You are currently viewing a new version of our website. To view the old version click .
Cancers
  • Article
  • Open Access

22 December 2025

Timepoint-Specific Benchmarking of Deep Learning Models for Glioblastoma Follow-Up MRI

and
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA
*
Author to whom correspondence should be addressed.
This article belongs to the Section Methods and Technologies Development

Simple Summary

Glioblastoma is an aggressive brain cancer, and follow-up MRI scans are used to determine whether changes after treatment represent real tumor growth or temporary treatment effects. This decision is difficult, especially at the first follow-up. We analyzed 180 patients and compared eleven deep learning models across two follow-up timepoints. Overall accuracy was similar at both timepoints, ranging from about 70% to 74%. However, the second follow-up provided clearer separation between the three clinical outcomes, with the best model improving its F1 score from 0.44 at the first follow-up timepoint to 0.53 at the second follow-up timepoint. A model that combines convolutional features with a state-space sequence method consistently gave the best balance of accuracy and efficiency, while some transformer models reached higher AUC values but required much more computation. These findings offer a practical benchmark to guide future research and clinical tool development.

Abstract

Background: Differentiating true tumor progression (TP) from treatment-related pseudoprogression (PsP) in glioblastoma remains challenging, especially at early follow-up. Methods: We present the first timepoint-specific, cross-sectional benchmarking of deep learning models for follow-up MRI using the Burdenko GBM Progression cohort (n = 180). We analyze different post-RT scans independently to test whether architecture performance depends on timepoint. Eleven representative DL families (CNNs, LSTMs, hybrids, transformers, and selective state-space models) were trained under a unified, QC-driven pipeline with patient-level cross-validation. Across both timepoints, accuracies were comparable (~0.70–0.74), but discrimination improved at the second follow-up, with F1 and AUC increasing for several models, indicating richer separability later in the care pathway. Results: A Mamba+CNN hybrid consistently offered the best accuracy–efficiency trade-off, while transformer variants delivered competitive AUCs at substantially higher computational cost, and lightweight CNNs were efficient but less reliable. Performance also showed sensitivity to batch size, underscoring the need for standardized training protocols. Notably, absolute discrimination remained modest overall, reflecting the intrinsic difficulty of TP vs. PsP and the dataset’s size and imbalance. Conclusions: These results establish a timepoint-aware benchmark and motivate future work incorporating longitudinal modeling, multi-sequence MRI, and larger multi-center cohorts.

1. Introduction

Gliomas are a heterogeneous group of tumors that originate from glial cells in the central nervous system and are among the most common types of brain tumors. They account for approximately 30% of all brain tumors and about 80% of malignant brain tumors. Gliomas can be classified into different types, including astrocytomas, oligodendrogliomas, and glioblastomas, with glioblastoma being the most aggressive and lethal form. The prognosis for patients with glioblastoma remains poor, with a median survival of approximately 15 months despite advances in treatment. The complexity and variability of gliomas, coupled with their infiltrative nature, present significant challenges in diagnosis and treatment, underscoring the need for innovative approaches to improve patient outcomes. A critical challenge in the treatment of brain tumors is differentiating between true tumor progression (TP) and treatment effect (TE), such as pseudoprogression (PsP) and radiation necrosis. TP refers to the actual growth or recurrence of the tumor, indicating a need for a change in treatment strategy to more aggressive or alternative therapies. In contrast, PsP and radiation necrosis are phenomena related to treatment effects and often managed by close monitoring rather than aggressive treatment. Because their imaging appearances can closely mimic true progression, accurate differentiation is difficult and can complicate clinical decision-making.
Machine learning (ML) and deep learning (DL) methods have been increasingly applied to address this challenge. Hu et al. applied a support vector machine on multiparametric MRI, comprising diffusion and perfusion parameters [1]. While the study was limited by a relatively small cohort, their voxel-wise classification approach demonstrated that the integration of advanced MRI modalities enhances discriminative ability. Radiomics-based approaches have also shown promise: a random forest classifier trained on T1C images outperformed neuroradiologists [2], gradient boosting on pre-treatment T1C radiomics demonstrated strong accuracy in external validation [3], and even low-parameter supervised methods yielded moderate AUCs on limited datasets [4]. Building on these early efforts, deep learning frameworks have achieved significant progress. Early studies combined MRI with clinical data in CNN–LSTM frameworks, achieving improved predictive accuracy over unimodal methods [5]. Li et al. proposed a DCGAN–AlexNet framework to mitigate overfitting in small datasets [6], while CNN–LSTM models treating multiparametric MRI as spatial sequences further boosted accuracy and AUC [7]. Automated DL methods, such as 3D DenseNet-121 applied to T2 and CET1 MRI [8], and radiomics classifiers validated on biopsy-proven gliomas [9], confirmed the feasibility of noninvasive PsP–TP prediction. Recent work has moved toward advanced multimodal and transformer-based approaches. Integration of imaging features with MGMT promoter methylation achieved high diagnostic performance [10,11]. Multiparametric MRI combined with MGMT status enhanced PsP–TP classification [12]. CFINet, an attention-based cross-modality feature interaction network leveraging T1- and T2-weighted MRI, demonstrated strong generalization on independent cohorts by effectively capturing complementary information across modalities [13]. Self-supervised multimodal ViTs leveraging contrastive and context-restoration pretraining further improved performance [14]. Nonetheless, most methods remain constrained to single timepoints and heterogeneous datasets, limiting reproducibility and generalizability.
One of the main challenges in ML/DL in this setting is the small sample size and resulting class imbalance due to the low prevalence of TP and PsP. Several studies have used oversampling techniques, such as the Synthetic Minority Oversampling Technique (SMOTE) [15]. Two-network architectures with similarity loss improved robustness under limited-data conditions [16] by training two identical networks in parallel with independently undersampled k-space inputs at different reduction factors. A cosine similarity-based loss combined with L1 loss enforced the networks to learn both common and distinct features, while exponential moving average (EMA) updating of the target network further stabilized training. Methodological cautions highlight that oversampling must be confined to training sets to avoid data leakage [17]. More advanced methods, such as OCH-SMOTE [18], DR-SMOTE [19], and hybrid BSGAN [20], further expanded applicability. Taken together, these findings underscore the importance of class balancing strategies in ensuring robust and generalizable ML/DL pipelines for PsP–TP classification.
Another challenge is how different DL architectures perform across longitudinal follow-ups. Multiple studies have shown that pseudoprogression tends to occur early after radiotherapy; for example, up to 50% of malignant glioma patients show evidence of pseudoprogression on MRI immediately after chemo-radiotherapy [21], and about 60% of PsP cases are reported within the first 3 months post-radiotherapy [22]. By contrast, true tumor progression generally emerges later, and MRI changes beyond 3–6 months are more likely to reflect actual recurrence than transient treatment effects [23]. No prior study has systematically examined whether specific architectures are better suited for early versus late follow-ups—a question with direct clinical relevance.
In this work, we address this critical gap using the Burdenko Glioblastoma Progression Dataset [24], which comprises 180 patients, is substantially larger than previously reported cohorts, and provides longitudinal T1C MRI follow-ups. Table 1 summarizes prior studies, where most datasets ranged between 30 and 130 patients. We preprocessed raw MRI data, mitigated class imbalance through an autoencoder (AE) hybrid with SMOTE and augmentation strategies, and conducted a systematic comparison of deep learning techniques to determine which follow-up timepoint offers greater diagnostic utility and which architectures perform most effectively at each timepoint. We benchmarked CNNs, LSTMs, CNN–LSTM hybrids, ResNets, attention-augmented CNNs, Vision Transformers (2D/3D), Swin Transformers, and Mamba-based hybrids across two clinically relevant timepoints: (1) the first follow-up after radiation therapy and before adjuvant chemotherapy and (2) the second follow-up after combined chemo-radiotherapy. By harmonizing preprocessing protocols and employing patient-level cross-validation, we present the first timepoint-specific comparative evaluation of modern deep learning models in longitudinal glioma imaging. Our findings establish a rigorous benchmark that not only guides methodological choices for machine learning researchers but also delivers translational insights for neuro-oncology, underscoring the importance of timepoint-adaptive modeling strategies in clinical decision support.
Table 1. Summary of prior imaging-based machine and deep learning studies for differentiating true progression (TP) from pseudoprogression (PsP) in glioblastoma.

2. Methods

We systematically developed and benchmarked a diverse set of architectures to interrogate post-treatment glioblastoma MRI, spanning conventional 3D CNNs and ResNets, sequential models (LSTM-based), transformer variants (2D/3D-ViT, Swin Transformer), Mamba models, and novel hybrids that combine CNNs with attention, LSTM, shift windows patching, or Mamba state-space modules. The detailed architectures of these models are provided in the Supplementary Materials.
To systematically investigate how model architecture and clinical timepoint affect predictive performance, we developed a standardized pipeline encompassing dataset curation, preprocessing, model implementation, training/validation, and evaluation. The workflow was designed to ensure reproducibility, prevent information leakage, and provide a fair basis for comparison across architectures.

2.1. Dataset and Preprocessing

We used the Burdenko Glioblastoma Progression Dataset (Burdenko-GBM-Progression), https://www.cancerimagingarchive.net/collection/burdenko-gbm-progression/ (accessed on 2 December 2025. Archived version: https://web.archive.org/web/20240101000000/https://www.cancerimagingarchive.net/collection/burdenko-gbm-progression/), comprising 180 patients with primary glioblastoma treated between 2014 and 2020. The dataset includes multi-sequence MRI, CT, clinical, and molecular data; for this study, we focused exclusively on contrast-enhanced T1-weighted MRI (T1C), the most widely available and clinically relevant sequence for follow-up assessment. Imaging was acquired on scanners from four vendors with heterogeneous protocols, reflecting real-world clinical variability. Notably, the dataset documentation does not report scanner field strengths. The dataset is publicly available through The Cancer Imaging Archive (TCIA).
All T1C volumes were standardized through a multi-timepoint pipeline (Figure 1). First, raw DICOM series were converted to NIfTI format using dicom2nifti 2.6.1. and subjected to quality control. Implausible acquisitions were excluded using geometry thresholds on voxel spacing, anisotropy, and slice thickness; among valid candidates, the volume with the highest clarity score was selected per timepoint. Logs and slice previews were generated for auditability. Retained volumes were resampled to 1283 voxels, skull-stripped using the BET2 (FSL 6.0.7.17, Oxford, UK), rigidly registered to MNI152 space with the “MNI152_T1_1mm.nii.gz” template provided in FSL 6.0.7.17, and Z-score normalized within the brain mask to ensure cross-subject consistency. Deterministic image–label pairing ensured reproducibility at the patient level. For training, data imbalance was mitigated through oversampling and augmentation. Synthetic 3D MR volumes were generated using a lightweight autoencoder with latent-space interpolation, and mild perturbations—small affine transforms (in-plane rotations within ±3° and scaling between 0.98 and 1.02) and Gaussian noise (zero-mean, σ ≤ 0.02)—were applied to both real and synthetic data. After augmentation, 1611 subjects were available at the first follow-up and 1431 subjects at the second follow-up. Full implementation details, including thresholds and parameter settings, are provided in the Supplementary Methods.
Figure 1. Preprocessing workflow. (a) Whole-set pipeline from raw T1C identification to standardized outputs. (b) High-quality selection logic per patient–timepoint with explicit geometry thresholds and clarity-score ranking; per-series CSV logging and preview generation. (c) Training-only oversampling and augmentation: 3D AE latent-space SMOTE with decoding to 3D volumes augmentations; validation/test splits remain unmodified.

2.2. Labeling Strategy

In the Burdenko dataset, each follow-up visit is annotated with a clinical label, which can vary over time. To create a single, reproducible outcome label per patient, we developed a consolidation framework in consultation with a neuro-oncologist. The rules were as follows:
Progression override: Any evidence of true progression at any timepoint defined the patient as Progression, irrespective of prior or subsequent labels, given its clinical impact.
Pseudoprogression: A patient was labeled PsP if the most recent follow-up indicated PsP and no prior imaging confirmed progression. PsP was also retained if the initial PsP was followed by stability or response without subsequent progression.
Stable disease: Patients with 3 consecutive follow-ups showing only stability or response, without new progression, were labeled Stable.
Scarce follow-up: For patients with 2 assessments, the most recent report determined the label (PsP if the last scan was PsP; otherwise, Progression).
Distant progression: Cases with new lesions outside the primary site were immediately classified as Progression.
This systematic framework ensured that final labels were clinically meaningful, consistent across patients, and reproducible for downstream modeling. Accordingly, the final consolidated outcome categories consisted of three labels: Progression, Pseudoprogression, and Stable.

2.3. Model Architectures

We benchmarked eleven representative DL architectures spanning three design families: (1) Base volumetric models: 3D CNN, LSTM, 3D-ViT, and ResNet. These capture spatial or sequential dependencies directly from T1C volumes. (2) Hybrid models: CNN+LSTM, CNN+SE Attention, and 2D-ViT+LSTM. These combine local convolutional features with sequential or attention-based mechanisms. (3) Advanced models: 3D Swin Transformer, Swin CNN, 2D-Mamba, and 2D-Mamba+CNN, which incorporate hierarchical attention or state-space modeling for efficient long-range dependency capture. Architectures were chosen to represent the spectrum of contemporary strategies: convolutional inductive biases, recurrent modeling of slice order, attention-based global context, and state-space efficiency. All models were trained under the same preprocessing, augmentation, and patient-level cross-validation pipeline, ensuring that observed performance differences reflected architecture rather than implementation. The architectures of the base volumetric models and hybrid models are provided in Supplementary Figures S1–S7 Methods. The details of the developed models for the advanced models are described as follows:
Shift Windows Transformer (Swin Transformer)—To reduce the quadratic cost of global attention, Swin Transformer introduces windowed self-attention with shifted windows and a hierarchical (pyramidal) design, yielding strong accuracy–efficiency trade-offs. For volumetric MRI, the input T1C volume ( 1 × 128 × 128 × 128 ) was partitioned into non-overlapping 4 × 4 × 4 patches and linearly embedded, as shown in Figure 2a. The token grid was processed by four 3D Swin stages (W-MSA/SW-MSA+MLP with residual connections and layer normalization). Channel widths increased 48 → 96 → 192 → 384 with patch-merging down-sampling the resolution 32 3 16 3 8 3 4 3 , enabling cross-window interaction while controlling compute. The final feature map was aggregated by AdaptiveAvgPool3D (1, 1, 1), projected to 128 dimensions, and passed to a linear–SoftMax classifier (128 → 3) to yield class probabilities. This formulation captures long-range volumetric dependencies efficiently while preserving locality through windowed attention.
Figure 2. The architecture of the developed Swin-based models: (a) 3D-Swin Transformer and (b) Swin CNN (architectures).
Shift Window Patching CNN (Swin CNN)—Despite recent transformer variants, convolutional networks remain highly efficient for volumetric data because shared 3 × 3 × 3 kernels capture local structures with modest parameters. We therefore adopted a 3D-CNN that borrows the window shift idea from Swin: the input T1C volume ( 1 × 128 × 128 × 128 ) was first processed by two Conv3D–BatchNorm–ReLU blocks (kernel 3 3 , padding 1) to extract low-level features, which is shown in Figure 2b. The resulting feature map was then partitioned into fixed-size 3D windows, within which an additional per-window Conv3D–BN–ReLU was applied. To allow limited cross-window interaction while preserving locality, a second pass with shifted windows (offset by half the window size) could be performed; in practice, this yielded stable performance without materially increasing memory. Window-local convolutions focused capacity on fine anatomical details and constrained computation/memory, while the initial global blocks provided broader context. Features were aggregated (global pooling to a 128-D vector) and fed to a linear–SoftMax head (128 → 3) to produce class probabilities.
Two-Dimensional Linear-Time Sequence Modeling with Selective State Spaces (2D-Mamba)—Mamba is a selective state-space model (SSM) introduced for long-sequence modeling with linear-time complexity; vision adaptations (often called Vision Mamba) map 2D grids to scan orders processed by SSM blocks, offering competitive accuracy with fewer parameters than attention-based transformers. In Figure 3a, each preprocessed T1-contrast (T1C) volume was represented as an ordered set of 50 axial slices. Slices were pad/crop–normalized to 128 × 128 and then resized to 224 × 224 and channel-expanded (grayscale replicated) to match ImageNet-style pretraining. A Vision-Mamba tiny backbone produced a 384-dimensional embedding per slice, yielding a sequence of shapes 50 × 384 . Sequence features were mean-pooled to a volume-level descriptor, projected to 128 dimensions, and passed to a linear–SoftMax classifier to output three class probabilities. By using SSM blocks to capture long-range dependencies within slices while aggregating across slices with lightweight pooling, this model retains global contextual modeling with favorable computational efficiency.
Figure 3. The architecture of the developed Mamba-based models: (a) 2D Mamba and (b) 2D Mamba+CNN architectures. The asterisk (*) denotes the beginning token of the entire sequence.
Two-Dimensional Mamba Hybrid with CNN—Convolutional networks capture local spatial patterns efficiently, while Mamba—a selective state-space model (SSM)—models’ long-range dependencies in linear time, offering a lighter alternative to self-attention. In Figure 3b, we combined both: each preprocessed T1-contrast volume (1 × 128 × 128 × 128) was first subjected to mean pooling along the z-axis to reduce through-plane redundancy, then passed through three Conv3D–BatchNorm–ReLU blocks interleaved with MaxPool3D (stride 2) to extract and compress volumetric features. The resulting feature map was upsampled and rearranged into an ordered slice sequence (50 slices), which was fed to a Vision-Mamba tiny backbone, producing a 384-dimensional embedding per slice. Sequence embeddings were mean-pooled to a volume-level descriptor, projected to 128 dimensions, and classified with a linear–SoftMax head (128 → 3). This design leverages CNNs for fine local structure while using SSM blocks to aggregate global context across slices with favorable compute and parameter efficiency.

3. Results

Analyses were anchored to two clinically defined follow-ups, evaluated as independent cohorts under identical procedures: (1) the first follow-up after radiotherapy (typically ~3–4 weeks post-RT) and (2) the second follow-up after radiotherapy (typically ~2–3 months post-RT). Evaluating the baseline and follow-up scans independently allowed us to characterize model performance across different timepoints. Between the two timepoints of follow-ups, some clear patterns emerged. The performance, in general, was similar across architectures, and the first follow-up was not as good in terms of the separability of classes as the second. The complex models tended to have better discrimination at the expense of significantly larger computation time, while the lightweight models were computationally efficient but less robust. The hybrid architectures achieved the best trade-off between accuracy and efficiency. These patterns form the context for the detailed results reported below.
All experiments were implemented in Python3.9.23 using PyTorch 2.5.1 (with CUDA 12.1) and MONAI 1.4.0. and executed on a workstation with an NVIDIA RTX A6000 (48 GB) GPU and 128 GB RAM. Unless otherwise noted, the models were trained using Adam optimizer (learning rate = 1 × 10−4), cross-entropy loss, 10 epochs, determined empirically, and random seeds {21, 33, 42}. Batch sizes of one and eight were used for most models, while computationally intensive architectures (Mamba, Swin Transformer) required smaller batch sizes (one and six) due to GPU memory limits on 3D volumes. The 2D-Mamba+CNN model was further evaluated under batch sizes {1, 2, 4, 8} to assess robustness. Patient-level stratified five-fold cross-validation was used throughout. To prevent information leakage, all preprocessing, augmentation, and synthetic oversampling were fit exclusively on the training split within each fold. Class imbalance was mitigated using latent-space SMOTE (256-D embedding, k = 3) applied to autoencoder features and reconstructed into 3D volumes, with caps on synthetic samples to prevent oversampling bias. Mild augmentations (small affine transforms, Gaussian noise) were applied only to the training data. We report accuracy, macro-averaged F1 score, and macro-averaged AUC to reflect overall performance and robustness under class imbalance. In addition, we measured computational efficiency: FLOPs, parameter count, average batch inference time, and total training runtime.
Classification performanceTable 2 summarizes the predictive performance across both follow-up cohorts. Accuracy values were relatively stable across timepoint (≈0.70–0.74), but discrimination improved at the second follow-up, reflected in higher F1 and AUC values for several models. The 2D-Mamba+CNN model demonstrated the most consistent trade-off across follow-ups, achieving 74.5% accuracy and F1 = 0.44 at the first follow-up and improving to 74.1% accuracy and F1 = 0.53 at the second follow-up. This suggests enhanced class separability later in the care pathway. The transformer-based models (3D-ViT, 2D-ViT+LSTM, Swin Transformer) yielded competitive accuracy and AUC but exhibited higher variance and unstable F1 scores. The lightweight CNNs remained efficient and stable, though their discrimination performance lagged behind. Notably, the highest AUC at the first follow-up was obtained with the 3D-ViT model (AUC = 0.57), whereas at the second follow-up, the 2D-Mamba+CNN model achieved the strongest AUC (0.66). Full results across all batch sizes are reported in Supplementary Table S1. Representative predictions from 2D-Mamba+CNN are illustrated in Figure 3b.
Table 2. Summary of classification performance (accuracy, F1-score, and AUC) on first and second follow-up MRIs using a consistent batch size (Batch = 1). Complete results across all batch sizes are provided in Supplementary Table S1.
Computational Complexity—We quantify computational efficiency using FLOPs, trainable parameters, average batch inference time, and total training runtime (Table 3). All values are reported as mean ± SD across seeds {21, 33, 42} under the same hardware configuration. The transformer-based models (2D-ViT+LSTM, 3D-ViT, Swin Transformer) were computationally demanding (≥230 GFLOPs) with runtimes exceeding 1000 min, which challenges clinical deployment despite competitive accuracy/AUC. In contrast, the lightweight CNNs were highly efficient (<13 GFLOPs; <0.1M parameters), with stable runtimes, but exhibited lower discrimination. Full results across all batch sizes are reported in Supplementary Table S2.
Table 3. Summary of computational complexity and efficiency metrics (FLOPs, number of parameters, batch time, and runtime) across models using a consistent batch size (batch = 1). Detailed results for all batch sizes are provided in Supplementary Table S2.
The 2D-Mamba+CNN model provided the most favorable efficiency–performance balance: a moderate parameter count (~24M) coupled with very low compute (<1 GFLOP) and reasonable runtime, while maintaining the most consistent accuracy–F1 trade-off across both follow-ups. These characteristics suggest a more practical path toward resource-constrained clinical environments than transformer variants, without the performance drops observed for ultra-light CNNs.
To conclude, the empirical evidence demonstrates that the hybrids of convolutional and Linear-Time Sequence Modeling with Selective State Spaces (Mamba+CNN) consistently yield the best trade-off between predictive cost and computational cost. At both follow-up MRIs, Mamba+CNN had the most stable performance with accuracy and a significantly improved F1-score in the second follow-up, indicating that it could capture spatial and temporal dependencies. While the transformer-based models showed strong accuracy and AUC, their computational requirements prohibited their use in real clinical settings. On the other hand, the efficient CNNs were computationally inexpensive, but the quality of classification dropped significantly. Although Mamba+CNN achieved the best performance across most metrics and performed best with smaller batch sizes, it did not obtain the highest AUC. At the first follow-up, the highest AUC was achieved by 3D-ViT with a batch size of one, while at the second follow-up, CNN+Mamba with a batch size of four achieved the highest AUC. The Mamba+CNN model’s predicted results for each class are shown in Figure 4.
Figure 4. Representative examples of model predictions of the first follow-up’s and second follow-up’s three classes T1C MRI.

4. Discussion

In general, the second follow-up exhibited a greater separation between progression, pseudoprogression, and stable disease, representing the expected shift in the clinic from treatment-related changes in early imaging to more definitive evidence of recurrence at later timepoints. Although transformer-based models provide strong discrimination, they are computationally intensive, which restricts their use in daily clinical practice. The lightweight CNNs were highly efficient but offered lower robustness, whereas the 2D-Mamba+CNN model provided the strongest balance between performance and computational cost, making it a more viable candidate for clinical integration.
Several limitations should be noted. This study focused on two standardized post-treatment MRI timepoints that were consistently available across patients. The first follow-up, obtained approximately 3 weeks after radiotherapy (RT), reflects the early post-treatment period when inflammatory and treatment-related effects are most prominent. The second follow-up, performed 2–3 months after RT, represents a later stage in which evolving tumor biology or emerging pseudoprogression becomes more apparent. Although additional MRIs existed for some patients, these timepoints were not uniformly available or aligned across the cohort. Therefore, we selected these two clinically meaningful and consistently acquired follow-ups to ensure comparability and avoid substantial loss of sample size or selection bias. The dataset is still quite small and imbalanced, and multi-center validation is required to verify its generalizability. In addition, this study only assessed T1C MRI. While this sequence is widely accessible, it cannot comprehensively describe treatment-induced alterations on imaging, and the addition of multi-sequence MRI, such as FLAIR, T1, and T2, may lead to an even better separation of progression from pseudoprogression.

5. Conclusions

We presented the first timepoint-specific benchmarking of deep learning architectures for differentiating true progression from pseudoprogression in glioblastoma using follow-up MRI. Evaluating eleven model families within a unified, quality-controlled pipeline, we found that overall performance remains moderate but improves at later follow-ups, consistent with richer clinical separability between treatment effects and true recurrence over time. These results establish a timepoint-aware benchmark for follow-up MRI analysis in glioblastoma and underscore the importance of standardized, leakage-aware training and evaluation protocols for this challenging task.
At the architectural level, our findings extend prior work on hybrid models for medical imaging by systematically comparing CNNs, transformers, and state-space-based approaches in longitudinal glioblastoma follow-ups. A 2D-Mamba+CNN model provided the most reliable trade-off between predictive accuracy and computational efficiency, whereas transformer-based models (e.g., ViT, Swin) achieved competitive accuracy and AUC at the cost of higher computational burden and more variable F1 scores. In contrast, lightweight CNNs were highly efficient in terms of FLOPs and parameter counts but exhibited lower discrimination, reflecting the difficulty of handling class imbalance and complex spatiotemporal patterns in this setting. Broader validation on multi-center, multi-sequence cohorts, incorporation of clinical and molecular covariates, and exploration of self-supervised and model-compression strategies will be critical next steps toward practical deployment of automated follow-up MRI analysis in neuro-oncology.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/cancers18010036/s1, Figure S1: CNN architecture; Figure S2: LSTM architecture; Figure S3: 3D-ViT architecture; Figure S4: 3D-ResNet architecture; Figure S5: CNN+LSTM architecture; Figure S6: CNN+SE Attention architecture; Figure S7: 2D ViT+LSTM architecture; Table S1: Summary of classification performance on first and second follow-up MRIs; Table S2: Computational complexity and runtime metrics for all models.

Author Contributions

Methodology, formal analysis, original draft preparation, W.G. Supervision, review, and editing, G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study because the MRI data were fully anonymized and obtained from a publicly available dataset on The Cancer Imaging Archive, and no identifiable human information was accessed.

Data Availability Statement

The MRI data used in this study are publicly available from The Cancer Imaging Archive (TCIA) under the Burdenko Glioblastoma Progression Dataset (Burdenko-GBM-Progression). The dataset can be accessed at https://www.cancerimagingarchive.net/collection/burdenko-gbm-progression/ (accessed on 2 December 2025). Archived version: https://web.archive.org/web/20240101000000/https://www.cancerimagingarchive.net/collection/burdenko-gbm-progression/.

Acknowledgments

The authors thank Pierre Giglio, Director of Neuro-Oncology at The Ohio State University Wexner Medical Center, for his advice on MRI labeling and for reviewing the labeling strategy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hu, X.; Wong, K.K.; Young, G.S.; Guo, L.; Wong, S.T. Support vector machine multiparametric MRI identification of pseudoprogression from tumor recurrence in patients with resected glioblastoma. J. Magn. Reson. Imaging 2011, 33, 296–305. [Google Scholar] [CrossRef] [PubMed]
  2. Sun, Y.-Z.; Yan, L.-F.; Han, Y.; Nan, H.-Y.; Xiao, G.; Tian, Q.; Pu, W.-H.; Li, Z.-Y.; Wei, X.-C.; Wang, W.; et al. Differentiation of Pseudoprogression from True Progressionin Glioblastoma Patients after Standard Treatment: A Machine Learning Strategy Combinedwith Radiomics Features from T1-weighted Contrast-enhanced Imaging. BMC Med. Imaging 2021, 21, 17. [Google Scholar] [CrossRef] [PubMed]
  3. Ari, A.P.; Akkurt, B.H.; Musigmann, M.; Mammadov, O.; Blömer, D.A.; Kasap, D.N.G.; Henssen, D.J.H.A.; Nacul, N.G.; Sartoretti, E.; Sartoretti, T.; et al. Pseudoprogression prediction in high grade primary CNS tumors by use of radiomics. Sci. Rep. 2022, 12, 5915. [Google Scholar] [CrossRef] [PubMed]
  4. Warner, E.; Lee, J.; Krishnan, S.; Wang, N.; Mohammed, S.; Srinivasan, A.; Bapuraj, J.; Rao, A. Low-parameter supervised learning models can discriminate pseudoprogression and true progression in non-perfusion MRI. In Proceedings of the IEEE EMBC 2023, Sydney, Australia, 24–27 July 2023; pp. 1–4. [Google Scholar] [CrossRef]
  5. Jang, B.-S.; Jeon, S.H.; Kim, I.H.; Kim, I.A. Prediction of Pseudoprogression versus Progression using Machine Learning Algorithm in Glioblastoma. Sci. Rep. 2018, 8, 12516. [Google Scholar] [CrossRef]
  6. Li, M.; Tang, H.; Chan, M.D.; Zhou, X.; Qian, X. DC-AL GAN: Pseudoprogression and True Tumor Progression of Glioblastoma Multiform Image Classification Based on DCGAN and AlexNet. arXiv 2019, arXiv:1902.06085. [Google Scholar] [CrossRef]
  7. Lee, J.; Wang, N.; Turk, S.; Mohammed, S.; Lobo, R.; Kim, J.; Liao, E.; Camelo-Piragua, S.; Kim, M.; Junck, L.; et al. Discriminating pseudoprogression and true progression in diffuse glioma using multi-parametric MRI and data through deep learning. Sci. Rep. 2020, 10, 20331. [Google Scholar] [CrossRef]
  8. Moassefi, M.; Faghani, S.; Conte, G.M.; Kowalchuk, R.O.; Vahdati, S.; Crompton, D.J.; Perez-Vega, C.; Domingo Cabreja, R.A.; Vora, S.A.; Quiñones-Hinojosa, A.; et al. A deep learning model for discriminating true progression from pseudoprogression in glioblastoma patients. J. Neurooncol. 2022, 159, 447–455. [Google Scholar] [CrossRef]
  9. Turk, S.; Wang, N.C.; Kitis, O.; Mohammed, S.; Ma, T.; Lobo, R.; Kim, J.; Camelo-Piragua, S.; Johnson, T.D.; Kim, M.M.; et al. Comparative study of radiologists vs machine learning in differentiating biopsy-proven pseudoprogression and true progression in diffuse gliomas. Neurosci. Inform. 2022, 2, 100088. [Google Scholar] [CrossRef]
  10. Li, M.; Ren, X.; Dong, G.; Wang, J.; Jiang, H.; Yang, C.; Zhao, X.; Zhu, Q.; Cui, Y.; Yu, K.; et al. Distinguishing Pseudoprogression From True Early Progression in Isocitrate Dehydrogenase Wild-Type Glioblastoma by Interrogating Clinical, Radiological, and Molecular Features. Front. Oncol. 2021, 11, 627325. [Google Scholar] [CrossRef]
  11. McKenney, A.S.; Weg, E.; Bale, T.A.; Wild, A.T.; Um, H.; Fox, M.J.; Lin, A.; Yang, J.T.; Yao, P.; Birger, M.L.; et al. Radiomic Analysis to Predict Histopathologically Confirmed Pseudoprogression in Glioblastoma Patients. Adv. Radiat. Oncol. 2023, 8, 100916. [Google Scholar] [CrossRef]
  12. Yadav, V.K.; Mohan, S.; Agarwal, S.; de Godoy, L.L.; Rajan, A.; Nasrallah, M.P.; Bagley, S.J.; Brem, S.; Loevner, L.A.; Poptani, H.; et al. Distinction of pseudoprogression from true progression in glioblastomas using machine learning based on multiparametric magnetic resonance imaging and O6-methylguanine-methyltransferase promoter methylation status. Neurooncol. Adv. 2024, 6, vdae159. [Google Scholar] [CrossRef]
  13. Lv, Y.; Liu, J.; Tian, X.; Yang, P.; Pan, Y. CFINet: Cross-modality MRI feature interaction network for pseudoprogression prediction. J. Comput. Biol. 2025, 32, 212–224. [Google Scholar] [CrossRef] [PubMed]
  14. Gomaa, A.; Huang, Y.; Stephan, P.; Breininger, K.; Frey, B.; Dörfler, A.; Schnell, O.; Delev, D.; Coras, R.; Schmitter, C.; et al. A Self-supervised Multimodal Deep Learning Approach to Differentiate Post-radiotherapy Progression from Pseudoprogression in Glioblastoma. arXiv 2024, arXiv:2502.03999. [Google Scholar] [CrossRef]
  15. Liu, R.; Hall, L.O.; Bowyer, K.W.; Goldgof, D.; Gatenby, R.; Ben Ahmed, K. Synthetic minority image over-sampling technique: How to improve AUC for glioblastoma patient survival prediction. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; pp. 1357–1362. [Google Scholar] [CrossRef]
  16. Kalata, I.K.; Khan, R.; Nakarmi, U. Learning From Oversampling: A Systematic Exploitation of Oversampling to Address Data Scarcity Issues in Deep Learning- Based Magnetic Resonance Image Reconstruction. IEEE Access 2024, 12, 10591992. [Google Scholar] [CrossRef]
  17. Demircioğlu, A. Applying oversampling before cross-validation will lead to high bias in radiomics. Sci. Rep. 2024, 14, 11563. [Google Scholar] [CrossRef]
  18. Wang, J.; Awang, N. A Novel Synthetic Minority Oversampling Technique for Multiclass Imbalance Problems. IEEE Access 2025, 13, 10829925. [Google Scholar] [CrossRef]
  19. Gong, Y.; Wu, Q.; Zhou, M.; Chen, C. A diversity and reliability-enhanced synthetic minority oversampling technique for multi-label learning. Inf. Sci. 2025, 690, 121579. [Google Scholar] [CrossRef]
  20. Ahsan, M.M.; Raman, S.; Liu, Y.; Siddique, Z. Hybrid oversampling technique for imbalanced pattern recognition: Enhancing performance with Borderline Synthetic Minority oversampling and Generative Adversarial Networks. Mach. Learn. Appl. 2025, 13, 100637. [Google Scholar] [CrossRef]
  21. Taal, W.; Brandsma, D.; de Bruin, H.G.; Bromberg, J.E.; Swaak-Kragten, A.T.; Sillevis Smitt, P.A.E.; van den Bent, M.J. Incidence of Early Pseudo-Progression in a Cohort of Malignant Glioma Patients Treated with Chemoirradiation with Temozolomide. Cancer 2008, 113, 405–410. [Google Scholar] [CrossRef]
  22. Young, J.S.; Al-Adli, N.; Scotford, K.; Cha, S.; Berger, M.S. Pseudoprogression versus true progression in glioblastoma: What neurosurgeons need to know. J. Neurosurg. 2023, 139, 748–759. [Google Scholar] [CrossRef]
  23. Blakstad, H.; Mendoza Mireles, E.E.; Heggebø, L.C.; Magelssen, H.; Sprauten, M.; Johannesen, T.B.; Vik-Mo, E.O.; Leske, H.; Niehusmann, P.; Skogen, K.; et al. Incidence and outcome of pseudoprogression after radiation in glioblastoma patients: A cohort study. Neurooncol. Pract. 2024, 11, 36–45. [Google Scholar] [CrossRef] [PubMed]
  24. Zolotova, S.V.; Golanov, A.V.; Pronin, I.N.; Dalechina, A.V.; Nikolaeva, A.A.; Belyashova, A.S.; Usachev, D.Y.; Kondrateva, E.A.; Druzhinina, P.V.; Shirokikh, B.N.; et al. Burdenko-GBM-Progression Dataset (Version 1). The Cancer Imaging Archive, 2023. Available online: https://www.cancerimagingarchive.net/collection/burdenko-gbm-progression/ (accessed on 2 December 2025).
  25. Qian, X.; Tan, H.; Zhang, J.; Li, Y.; Zhao, W.; Chan, M.D.; Zhou, X. Stratification of pseudoprogression and true progression of glioblastoma multiform based on longitudinal diffusion tensor imaging without segmentation. Med. Phys. 2016, 43, 5889–5902. [Google Scholar] [CrossRef]
  26. Zhang, J.; Yu, H.; Qian, X.; Liu, K.; Tan, H.; Yang, T.; Wang, M.; Li, K.C.; Chan, M.D.; Debinski, W.; et al. Pseudoprogression identification of glioblastoma with dictionary learning. Comput. Biol. Med. 2016, 73, 94–101. [Google Scholar] [CrossRef]
  27. Booth, T.C.; Larkin, T.J.; Yuan, Y.; Kettunen, M.I.; Dawson, S.N.; Scoffings, D.; Canuto, H.C.; Vowler, S.L.; Kirschenlohr, H.; Hobson, M.P.; et al. Analysis of heterogeneity in T2-weighted MR images can differentiate pseudoprogression from progression in glioblastoma. PLoS ONE 2017, 12, e0176528. [Google Scholar] [CrossRef]
  28. Ismail, M.; Hill, V.; Statsevych, V.; Huang, R.; Prasanna, P.; Correa, R.; Singh, G.; Bera, K.; Beig, N.; Thawani, R.; et al. Shape Features of the Lesion Habitat to Differentiate Brain Tumor Progression from Pseudoprogression on Routine Multiparametric MRI: A Multisite Study. AJNR Am. J. Neuroradiol. 2018, 39, 2187–2193. [Google Scholar] [CrossRef]
  29. Kim, J.Y.; Park, J.E.; Jo, Y.; Shim, W.H.; Nam, S.J.; Kim, J.H.; Yoo, R.-E.; Choi, S.H.; Kim, H.S. Incorporating diffusion- and perfusion-weighted MRI into a radiomics model improves diagnostic performance for pseudoprogression in glioblastoma patients. Neuro-Oncology 2019, 21, 404–414. [Google Scholar] [CrossRef]
  30. Elshafeey, N.; Kotrotsou, A.; Hassan, A.; Elshafei, N.; Hassan, I.; Ahmed, S.; Abrol, S.; Agarwal, A.; El Salek, K.; Bergamaschi, S.; et al. Multicenter study demonstrates radiomic features derived from magnetic resonance perfusion images identify pseudoprogression in glioblastoma. Nat. Commun. 2019, 10, 3170. [Google Scholar] [CrossRef]
  31. Bani-Sadr, A.; Eker, O.F.; Berner, L.-P.; Ameli, R.; Hermier, M.; Barritault, M.; Meyronet, D.; Guyotat, J.; Jouanneau, E.; Honnorat, J.; et al. Conventional MRI radiomics in patients with suspected early- or pseudo-progression. Neurooncol. Adv. 2019, 1, vdz019. [Google Scholar] [CrossRef]
  32. Jang, B.-S.; Jeon, S.H.; Park, A.J.; Kim, I.H.; Lim, D.H.; Park, S.H.; Lee, J.H.; Chang, J.H.; Cho, K.H.; Kim, J.H.; et al. ML to predict pseudoprogression vs progression: Multi-institutional study. Cancers 2020, 12, 2706. [Google Scholar] [CrossRef]
  33. Liu, X.; Zhou, X.; Qian, X. Transparency-guided ensemble convolutional neural network for the stratification between pseudoprogression and true progression of glioblastoma multiform in MRI. J. Vis. Commun. Image Represent. 2020, 72, 102880. [Google Scholar] [CrossRef]
  34. Akbari, H.; Rathore, S.; Bakas, S.; Nasrallah, P.; Shukla, G.; Mamourian, E.; Rozycki, M.; Bagley, S.; Rudie, J.; Flanders, A.; et al. Histopathology-validated machine learning radiographic biomarker for noninvasive discrimination between true progression and pseudo-progression in glioblastoma. Cancer 2020, 126, 2625–2636. [Google Scholar] [CrossRef] [PubMed]
  35. Lohmann, P.; Elahmadawy, M.A.; Gutsche, R.; Werner, J.-M.; Bauer, E.K.; Ceccon, G.; Kocher, M.; Lerche, C.W.; Rapp, M.; Fink, G.R.; et al. FET-PET radiomics for differentiating Pseudoprogression in Glioma Patients Post-Chemoradiation. Cancers 2020, 12, 3835. [Google Scholar] [CrossRef]
  36. Kebir, S.; Schmidt, T.; Weber, M.; Lazaridis, L.; Galldiks, N.; Langen, K.-J.; Kleinschnitz, C.; Hattingen, E.; Herrlinger, U.; Lohmann, P.; et al. A Preliminary Study on Machine learning-Based Evaluation of static and Dynamic FET-PET for the Detection of Pseudoprogression in Patients with IDH-wildtype Glioblastoma. Cancers 2020, 12, 3080. [Google Scholar] [CrossRef]
  37. Baine, M.; Burr, J.; Du, Q.; Zhang, C.; Liang, X.; Krajewski, L.; Zima, L.; Rux, G.; Zhang, C.; Zheng, D. The Potential Use of Radiomics with Pre-Radiation Therapy MR Imaging in Predicting Risk of Pseudoprogression in Glioblastoma Patients. J. Imaging 2021, 7, 17. [Google Scholar] [CrossRef] [PubMed]
  38. Wang, Z.; Khan, R.; Abbasian, P.; Ryner, L.; Lambert, P.; Pitz, M.; Ashraf, A. Interpretable Deep Learning Model for Distinguishing Tumor Pseudoprogression from True Progression Using MRI Imaging of Glioblastoma Patients. In Proceedings of the Medical Imaging 2025: Computer-Aided Diagnosis, 134073K, San Diego, CA, USA, 17–20 February 2025; Volume 13407. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.