A Deep Learning Model for Preoperative Differentiation of Glioblastoma, Brain Metastasis, and Primary Central Nervous System Lymphoma: An External Validation Study

: (1) Background: Neuroimaging differentiation of glioblastoma, primary central nervous system lymphoma (PCNSL) and solitary brain metastasis (BM) represents a diagnostic and therapeutic challenge in neurosurgical practice, expanding the burden of care and exposing patients to additional risks related to further invasive procedures and treatment delays. In addition, atypical cases and overlapping features have not been entirely addressed by modern diagnostic research. The aim of this study was to validate a previously designed and internally validated ResNet101 deep learning model to differentiate glioblastomas, PCNSLs and BMs. (2) Methods: We enrolled 126 patients (glioblastoma: n = 64; PCNSL: n = 27; BM: n = 35) with preoperative T1Gd-MRI scans and histopathological conﬁrmation. Each lesion was segmented, and all regions of interest were exported in a DICOM dataset. A pre-trained ResNet101 deep neural network model implemented in a previous work on 121 patients was externally validated on the current cohort to differentiate glioblastomas, PCNSLs and BMs on T1Gd-MRI scans. (3) Results: The model achieved optimal classiﬁcation performance in distinguishing PCNSLs (AUC: 0.73; 95%CI: 0.62–0.85), glioblastomas (AUC: 0.78; 95%CI: 0.71–0.87) and moderate to low ability in differentiating BMs (AUC: 0.63; 95%CI: 0.52–0.76). The performance of expert neuro-radiologists on conventional plus advanced MR imaging, assessed by retrospectively reviewing the diagnostic reports of the selected cohort of patients, was found superior in accuracy for BMs (89.69%) and not inferior for PCNSL (82.90%) and glioblastomas (84.09%). (4) Conclusions: We investigated whether the previously published deep learning model was generalizable to an external population recruited at a different institution—this validation conﬁrmed the consistency of the model and laid the groundwork for future clinical applications in brain tumour classiﬁcation. This artiﬁcial intelligence-based model might represent a valuable educational resource and, if largely replicated on prospective data, help physicians differentiate glioblastomas, PCNSL and solitary BMs, especially in settings with limited resources.


Introduction
Preoperative classification of brain tumours represents a critical aspect of patient management. Brain metastases (BMs), glioblastoma and primary central nervous system NeuroSci 2023, 4 19 lymphomas (PCNSLs) are among the most frequent intracranial neoplasms in adults (17%, 14.3% and 1.9%, respectively); hence, a correct diagnosis is a crucial point in the therapeutic path of a large number of patients worldwide [1][2][3].
In spite of the increased efficiency and popularity of MRI and the availability of advanced neuroimaging techniques that may assist in differentiating glioblastomas, BMs and PCNSLs, cases showing atypical features may prove challenging even for expert clinicians who spend a large proportion of their work time identifying, segmenting and classifying these lesions [4,5].
As far as the T1-weighted gadolinium-enhanced (T1Gd) images considered in this study are concerned, glioblastomas appear as iso-hypointense masses with necrotic-cystic areas and irregular contrast-enhanced margins similar to solitary BMs; however, atypical glioblastomas may show minimal or absent central necrosis.
PCNSLs, on the contrary, are usually shown on T1Gd images as iso-hypointense masses with a homogeneous enhancement within the entire lesion boundaries; in atypical presentations, there is central necrosis that may mimic glioblastomas [6], and the preoperative use of steroids in patients with PCNSLs may entail false negative pathological results, requiring additional invasive manoeuvres and potential harm and costs [7] to obtain the correct diagnosis.
The aim of this study was to develop a fast and reliable system for brain tumour classification in an experimental retrospective clinical scenario. In a previous investigation [13], we designed and internally validated a DNN model, achieving excellent diagnostic performance. The purpose of this study was the external validation of the model's accuracy in differentiating GBMs, PCNSLs and BMs on T1Gd MRI scans and discussion of its eventual role in the amelioration of diagnostic and interventional workflows.

Study Definition
Ethical approval was waived by the two institutions involved, by the local Ethics Committees in view of the retrospective nature of the study and because all performed procedures were part of routine care. Informed consent was obtained from all participants included in the study. All procedures performed in studies involving human participants were in accordance with the Helsinki declaration.
An internal committee among authors (L.T., G.F., G.A.B., G.C., M.L.) was formed, and a consensus achieved on the current investigation's proper design and reporting guidelines. An extensive review of "Enhancing the quality and transparency of health research" (EQUATOR) [14] network "https://www.equator-network.org" (accessed on 4 January 2022) contents was performed, and the "Standard for reporting of diagnostic accuracy study-Artificial Intelligence" (STARD-AI) [15] guidelines were selected and followed in the study protocol definition. The STARD-AI [15] guidelines were developed to report AI diagnostic test accuracy studies as an evolution of the previous STARD 2015 version [16], with the addition of a specific focus on designing and reporting evidence provided through AI-centred interventions. Adherence to STARD-AI recommendations was reviewed by the senior authors (G.C. and M.L.) throughout the investigation and during final review.

Patient Selection
The medical records and preoperative imaging of patients who underwent surgical tumour resection or biopsy at "Fondazione IRCCS Cà Granda Ospedale Maggiore Policlinico, Milan, Italy" (named Training Site or TrS) between June 2020 and April 2021 and at "Ospedale San Gerardo di Monza, Monza, Italy" (named Testing Site or TeS) between January 2018 and November 2021 were retrospectively collected. Patient data were included in the analysis if preoperative T1Gd MR images were available and histological analysis confirmed the diagnosis of glioblastoma, PCNSL or solitary BMs.
Patients were excluded if: (1) Preoperative T1Gd MR images were absent or inadequate in quality, according to the senior neuroradiologists; (2) They had previously received intracranial intervention (surgical intervention, gamma knife surgery or radiation therapy); (3) Multiple enhancing lesions were detected on preoperative MRI; (4) In glioblastoma cases, histopathological exams included testing for IDH mutations-hence, only IDH1 and IDH2 wild-type tumours were further considered in the investigation.
One-hundred twenty-one patients operated on at the TrS were selected to provide image data for the training dataset of our DNN model, as reported in a previous study [13].
A total of 126 patients met the inclusion criteria at the TeS and were selected for external validation of the aforementioned model.

MR Acquisition and Image Pre-Processing
The MR image scanning parameters at the TrS are reported elsewhere [13]. Concerning the MRI acquisition protocol at the TeS, all brain MRI studies were performed with a 1.5 T system (Philips ® Ingenia 1.5T CX), including axial T2-weighted imaging, fluidattenuated inversion recovery (FLAIR) imaging, diffusion-weighted images (DWI) (a bvalue of 1000 sec/mm 2 and a single b-0 acquisition), susceptibility-weighted imaging (SWI), volumetric contrast-enhanced axial and sagittal T1Gd (Gadovist 1 mmol/mL; 0.1 mmol/kg body weight) imaging; ADC maps were calculated from isotropic DWI.
All MR images in the digital imaging and communications in medicine (DICOM) format were input to the Horos DICOM Viewer version 3.3.5, "www.horosproject.org" (accessed on 4 January 2022), a free, open-source medical imaging viewer and analytic tool. The lesions' regions of interest (ROIs) were manually delineated on volumetric axial T1Gd scans. After segmentation and signal intensity normalization, all ROIs were then centred in a 224 × 224 pixels black box and exported in PNG file format ( Figure 1).
Each ROI was used as input for all three channels expected by the ResNet model and was treated as an independent image to increase the input data, though a group of slices was available for each patient. The predicted diagnostic class for each patient was the most frequently voted among its entire ROI set. The reported performance metrics were computed considering the number of correctly predicted patients and not the whole ROI dataset.

Performance Metrics
The classification performance of the DNN model was evaluated considering the following metrics: (1) Area under the receiving operative characteristics curve (AUC-ROC):
Each ROI was used as input for all three channels expected by the ResNet model and was treated as an independent image to increase the input data, though a group of slices was available for each patient. The predicted diagnostic class for each patient was the most frequently voted among its entire ROI set. The reported performance metrics were computed considering the number of correctly predicted patients and not the whole ROI dataset.

Performance Metrics
The classification performance of the DNN model was evaluated considering the following metrics: (1)Area under the receiving operative characteristics curve (AUC-ROC): A complete explanation of the parameters mentioned above is beyond the scope of the current study; further comprehensive descriptions are available elsewhere [21].
A one-vs-rest (OVR) multiclass strategy was employed to extract performance metrics for each outcome class. Then, the average value and its 95% bootstrap confidence interval were computed for each performance metric on the hold-out test set.

Human "Gold Standard" Performance
The tumour radiological assessment was addressed by experienced neuroradiologists (P.R. and G.B.) with at least 10 years of clinical experience. Electronic radiological reports were retrospectively reviewed to collect the primary radiological diagnosis. Afterwards, a comparison with the histopathological charts was completed, and the diagnostic classes were checked for discrepancies between radiological and pathological characterization. An OVR multiclass method was employed to extract neuroradiologists' performance metrics for each outcome class.

Software and Hardware
All the statistical analyses were performed in a Jupyter Notebook using Python v.3.7.6 "https://www.python.org/" (accessed on 4 January 2022). The Python packages used for this study included: 'PyTorch v1.7' to develop and train the DNN model, 'Numpy' for Excel dataset handling; 'Scikit-learn' to compute performance metrics and 'Seaborn' to plot ROC-AUC. The workstation used to train the DNN model mounted an Intel Core i7-10700K processor, while the GPU was a Tesla K80 12GB.

Comparison of DNN Model and Neuroradiologists' Gold Standard Performance
The performance metrics achieved by expert neuroradiologists are provided in Table 3

Comparison of DNN Model and Neuroradiologists' Gold Standard Performance
The performance metrics achieved by expert neuroradiologists are provided in Table  3

Performance Validation
In a previous study, we reported on a DNN model capable of efficiently and accurately differentiating glioblastomas, PCNSLs and BMs in an experimental "offline" environ-ment [13]. Here, we externally validated the DNN model on "never seen" data gathered at an external academic site (TeS) with the comparable caseload, facility settings and technologies. The accuracy returned by our model was not inferior to a senior neuroradiologist's performance in identifying PCNSLs and glioblastomas; accuracy for BMs identification was moderate, despite being lower than human evaluation.
In light of our previous preliminary findings, the evidence of model robustness and generalizability achieved in the current study supports the thesis of our DNN model being "experimentally not inferior" to senior physicians in classifying brain tumours in an unbiased cohort, endorsing the development and deployment of such models in medical training and clinical practice if cleared by regulatory authorities.
As previously documented, differentiating dubious BMs from gliomas and PCNSLs is challenging per se. Despite exponential advancements in the last decade, no single MRI modality can differentiate PCNSLs, BMs and glioblastomas with absolute accuracy. The search for a single sequence candidate to better classify these tumours has been limited to academic speculation, being restricted to synthetic scenarios rather than simulating clinical practice decision workflow, where multimodality is preferred. Indeed, results from previous studies are contradictory [22,23], with several authors reporting either T2-weighted, FLAIR or T1Gd scans' superiority in brain tumour segmentation and classification [24][25][26]. The multimodality MRI approach recently showed promising diagnostic performance in differentiating brain neoplasms in experimental settings. Relevant findings were confirmed about dynamic susceptibility contrast (DSC) and apparent diffusion coefficient (ADC) maps combined with T1Gd-MRI scans. This multimodal approach came at the cost of an unstandardized diagnostic role due to the operator-dependent interpretation bias, high heterogeneity among brain tumour phenotypes and the additional need for hardware and set-up protocols, which might curb its use in facilities with limited resources [27][28][29].
During the study design, the authors agreed to implement T1Gd-MRI images only, relying on the greater worldwide availability of this sequence compared to diffusion and perfusion protocols, with the aim of extending the reproducibility of our workflow. Plus, the superior distinction of tumour borders and precise representation of central necrosis, which are common features of glioblastomas, atypical PCNSLs and BMs [30], facilitates manual segmentation avoiding ROIs' drawing biases. However, the inclusion of additional sequences might have allowed a superior performance in the classification task.
Performance on BMs scored significantly lower compared to both the internal validation dataset and neuroradiologists' performance metrics (accuracy: 77% vs. 81% vs. 89%, respectively [13]). This underperformance may be imputable to the great histological heterogeneity of this group of lesions and the consequent variability in radiological features. Additionally, a key distinguishing feature of BMs is abundant peritumoral oedema [31]; however, the peritumoural radiological environment was not included in the ROI segmentation of our dataset, which was limited to T1Gd boundaries. This might have influenced the lower performance of DNN on BMs, together with the neuroradiologists' access to clinical history and additional imaging work-ups that the DNN model was blinded to. Indeed, while the model was blinded to any additional historical or diagnostic information except T1Gd scans, the diagnostic process accomplished at the time of imaging work-up comprehended additional characterization by means of total body CT, positron emission tomography (PET), and advanced MRI scans in a proportion of cases; being the retrospective evaluation of radiological reports set in routine clinical practice, we could not assess whether the aforementioned diagnostic exams-not involved in the current investigation-had a valuable impact on the putative radiological diagnosis. The comparative performance of DNN and senior neuroradiologists should be evaluated accordingly, and conclusions should be drawn carefully.

Perspective for Clinical Application and Public Health Impact
From a public health perspective, diagnostic tools such as our validated DNN model represent a promising technology spreading worldwide within industry, academia, and personal life settings. It is estimated that implementing AI algorithms in the USA might save USD 150 billion in healthcare costs by 2026 [32], with a net benefit even in lowerincome countries, where AI experimentation is still under-practised. Implementation of AI protocols in healthcare is increasing in resource-poor countries of Asia and Africa collaterally to the wider availability of mobile phones, mobile health applications and cloud computing, which generate a sufficient mass of data to redirect to the purpose of studies like our own.
Given this, we believe that AI models might assist physicians in low-income countries in tackling macro and micro-scale healthcare disparities and might reduce healthcare borders and inequalities across high-and low-income countries by optimizing diagnostic workflows, augmenting physician performance in those settings where highly trained personnel are not routinely available or favouring teleconsultations and patient referral to more experienced hospitals. The whole process, as auspicated in high-income countries, might provide benefits to healthcare quality and allow weighted cost reduction [33], as suggested by a recent survey conducted in Pakistan [34]. However, our belief about the contributions of AI to healthcare optimization in such settings is speculative, and sufficient literature about AI use in resource-poor countries is still lacking to draw accurate previsions.

Perspective in Medical Education
Other than the previously discussed applications, efficiency of computer vision has already been demonstrated in other clinical scenarios (i.e., skin cancer classification, diagnosis of retinal disease, detection of mammographic lesions, fracture detection and many other tasks) [35][36][37][38].
Recent advancements have been made in integrating CV, and ML in general, into medical education and skill evaluation. Oliveira et al. reported a deep learning model called PRIME that is able to evaluate the microsurgical ability of different neurosurgeons in vessels dissection and micro-suture; the latter was designed with the aim of smoothing the microsurgical steep learning curve and providing a self-paced ML-advised tutor for continuous training without the need for any motion sensors around the operating table [39]. Similarly, Smith et al. reported a motion-tracking ML algorithm for surgical instrument monitoring during cataract surgery [40].
Finally, aimed to standardize surgical procedures, enhance training and lay the groundwork for future robot-assisted surgery, several groups are investigating whether DNN models can dissect surgical workflows into reproducible phases according to environmental exposure, segmentation of the anatomical scenario and instrument usage [41][42][43].

Strengths and Limitations
The DNN model hereby presented and validated on a cohort of more than one hundred patients is a simple but efficient tool able to help physicians diagnose atypical intracranial tumours with limited addition of human effort. Despite not being used in real-time scenarios yet, it is a promising and robust classification model and a candidate for further investigations in clinical trials. Nevertheless, several limitations restrict the generalizability of our results; the outcome accuracy was gauged in "offline" settings on a retrospective pool of image data. To date, the usefulness in actual clinical practice has been inferred but not demonstrated. In fact, while neuroradiologists with access to other relevant information scored as high as the DNN model in the majority of classes (and even higher on BMs), the interaction between the DNN response and the human decision-making process has not been experienced and evaluated. Further prospective trials are required to clarify the impact of artificial intelligence-based decision-making tools on human judgement and performance in clinical practice.  Informed Consent Statement: Informed consent was obtained from all individual participants included in the study. Written informed consent has been obtained from the patient(s) to publish this paper.

Conclusions
Data Availability Statement: All authors confirm the appropriateness of all datasets and software used to support the conclusion. The dataset that supports the findings of this study is available from the corresponding author, L.T., upon request. The source code employed to develop the herein presented deep learning model is available from the corresponding author, L.T., upon request.

Conflicts of Interest:
The authors have no relevant financial or non-financial interests to disclose.