Accuracy of Machine Learning Algorithms for the Classification of Molecular Features of Gliomas on MRI: A Systematic Literature Review and Meta-Analysis

Simple Summary Glioma prognosis and treatment are based on histopathological characteristics and molecular profile. Following the World Health Organization (WHO) guidelines (2016), the most important molecular diagnostic markers include IDH1/2-genotype and 1p/19q codeletion status, although more recent publications also include ARTX genotype and TERT- and MGMT promoter methylation. Machine learning algorithms (MLAs), however, were described to successfully determine these molecular characteristics non-invasively by using magnetic resonance imaging (MRI) data. The aim of this review and meta-analysis was to define the diagnostic accuracy of MLAs with regard to these different molecular markers. We found high accuracies of MLAs to predict each individual molecular marker, with IDH1/2-genotype being the most investigated and the most accurate. Radiogenomics could therefore be a promising tool for discriminating genetically determined gliomas in a non-invasive fashion. Although encouraging results are presented here, large-scale, prospective trials with external validation groups are warranted. Abstract Treatment planning and prognosis in glioma treatment are based on the classification into low- and high-grade oligodendroglioma or astrocytoma, which is mainly based on molecular characteristics (IDH1/2- and 1p/19q codeletion status). It would be of great value if this classification could be made reliably before surgery, without biopsy. Machine learning algorithms (MLAs) could play a role in achieving this by enabling glioma characterization on magnetic resonance imaging (MRI) data without invasive tissue sampling. The aim of this study is to provide a performance evaluation and meta-analysis of various MLAs for glioma characterization. Systematic literature search and meta-analysis were performed on the aggregated data, after which subgroup analyses for several target conditions were conducted. This study is registered with PROSPERO, CRD42020191033. We identified 724 studies; 60 and 17 studies were eligible to be included in the systematic review and meta-analysis, respectively. Meta-analysis showed excellent accuracy for all subgroups, with the classification of 1p/19q codeletion status scoring significantly poorer than other subgroups (AUC: 0.748, p = 0.132). There was considerable heterogeneity among some of the included studies. Although promising results were found with regard to the ability of MLA-tools to be used for the non-invasive classification of gliomas, large-scale, prospective trials with external validation are warranted in the future.


Introduction
The most common primary brain tumor-glioma-is a rare cancer, but it is invariably fatal despite surgery, chemotherapy, and radiotherapy. While primary central nervous system tumors account for only 2% of primary tumors, they cause 7% of the years of life lost from cancer before age 70 [1][2][3]. Current glioma classification is based on the 2016 World Health Organization (WHO) guidelines, which differentiates subtypes of gliomas based on the presence or absence of isocitrate dehydrogenase (IDH) mutation and 1p/19q codeletion status. In addition to the mutation status, cytologic features and degrees of malignancy after hematoxylin and eosin (H&E) staining are also evaluated ( Figure 1). Over the years, various other molecular biomarkers have been reported in the scientific literature, which led the European Association of Neuro-Oncology (EANO) to consider it necessary to update its guideline for the management of adult patients with gliomas [4]. Improved differentiation between the different subtypes of oligodendroglial tumors and astrocytic tumors based on neuroimaging would be beneficial, as this would facilitate the treatment planning, such as the extent of the resection margins and radiotherapy field [5]. Molecular characteristics of glioma have been shown to represent hallmark features that help clinicians to accurately define the nature of the neoplasm. For example, primary glioblastomas are characterized by a distinct pattern of genetic aberrations when compared with secondary glioblastomas, which develop by degeneration of pre-existing lower-grade gliomas [6]. Moreover, molecular characteristics are known to impact the effectiveness of certain treatment options and can therefore help to identify the most suitable treatment strategy for each patient individually [7,8]. Finally, the different subtypes of glioma are known to have different survival rates [9,10]. With regard to prognosis, patients suffering from a grade II glioma with an oligodendroglial origin have a 5-year survival rate of 81%, whereas those suffering from a grade II astrocytic glioma have a 5-year survival rate of 40%. When classified as WHO grade III, oligodendroglial tumors have better 5-year survival rates as compared to astrocytic tumors (43% vs. 20%, respectively). The patients suffering from glioblastoma (grade IV) have the poorest outcomes, with a 5-year survival rate of 5.5% [11]. In terms of treatment, preoperative distinguishing of oligodendroglial tumors from astrocytic tumors would be beneficial in facilitating the planning, extent of the resection, and the radiotherapy field [5]. Unfortunately, no visual features have yet been proven accurately enough to circumvent histopathological assessment after neurosurgical intervention. Application of machine learning algorithms (MLAs), however, could be helpful in the non-invasive characterization of gliomas [12].
As previously predicted, MLAs are increasingly becoming a critical component of advanced software systems in radiology [13,14]. MLAs concern medical imaging analysis carried out by (automatic) feature selection, followed by automatic classification. These processes detect complex patterns in images elusive to the eyes of neuroradiologists and make predictions that surpass human intelligence and human-level performance. In general, input data for MLAs consist of the imaging data themselves (e.g., different MRI sequences) and/or the segmentation of the regions of interest. Output data, on the other hand, are the desired parameters that should be extracted from the imaging data [13][14][15]. In general, the dataset is divided into two different sets: the training and the test set. The training set is used to train the performance of the MLA, indicating that the MLA is attempting to elucidate an often complex relationship between input data and output data. The test set is then used to test the actual performance of the data on a new dataset, indicating that the network has not yet been able to train on these data. The term "test set" is often used interchangeably with "validation" set. Nevertheless, only a small amount of MLAs are actually validated on a completely different, external dataset, which significantly hampers the further development of the integration of MLAs in daily practice [15].
With regard to the use of MLAs in neuro-oncology imaging, various reports on the use of MLAs, using a broad range of extracted features on magnetic resonance imaging (MRI), showed promising results with regard to the prediction of molecular markers and genetic alterations (e.g., IDH genotype, 1p/19q codeletion status, P53 mutations, MGMT promoter mutation, TERT promoter mutation, BRAF status, EGFR receptor mutations) [3,12,[16][17][18][19]. However, one of the limitations of this type of research is the relatively limited amount of data in each study, which could possibly be overcome by a systematic review and metaanalysis of the aggregated study results [20]. The purpose of this study was to provide such an overview and perform a meta-analysis of the accuracy of MLAs in predicting gliomas' genotype.

Guidelines and Registration
A systematic review and meta-analysis were conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [21]. Prior to the initiation of the review, the study protocol was registered in the international open-access Prospective Register of Systematic Reviews (PROSPERO) under the number CRD42020191033.

Search Strategy, Inclusion Criteria, and Exclusion Criteria
Papers describing the use of MLAs for the binary classification of molecular characteristics of gliomas were reviewed. Databases searched for literature included Medline (accessed through PubMed), EMBASE, and the Cochrane Library. Searches were conducted from 1 April 2020 to 24 January 2021. Search strings are made available in the Supplementary Methods S1 and were included when they discussed the use of classification MLA methodologies on MR images in glioma patients. Next, papers must report results as mean accuracy and/or mean area under the receiver operator characteristics curve (AUC). Papers were excluded from this review when they discussed findings in animal-based studies or in non-human samples. In addition, MLA models needed to be at least internally validated. Letters, preprints, scientific reports, and reviews were included. After the removal of duplicates, the remaining papers were systematically screened on title and abstract by two researchers (E.v.K. and D.H.) independently. Non-consensus papers were identified and discussed by two researchers (E.v.K. and D.H.) to resolve disagreements and to reach consensus. Formal quality assessment tools are still lacking for this type of research [19], although a version of the TRIPOD statement tailored to machine learning methods has been announced [22].
Standardized tables were used to acquire the information of interest from the included articles. Data extracted from each study were (a) first author and year of publication, (b) size of the training set, (c) mean age of participants in the training set, (d) sex of participants in the training set, (e) size of the validation set, (f) whether there was an external validation, (g) study design, (h) architecture of the MLA algorithm(s), (i) target condition, (j) performance of the algorithm(s). Performance of the classification tools was expressed in accuracy, AUC, sensitivity, and specificity for both the training and the validation set. For studies performing external validation, externally validated data were displayed. Extracted data were cross-checked afterward, and discrepancies were resolved.

Statistical Analysis
Meta-analysis was conducted on the papers, which included the AUC ± standard deviation (SD) using a random-effects model to estimate the performance of the included MLA methodologies. For inclusion in quantitative analysis, studies must have reported a standard deviation, 95% confidence interval (CI), or standard error, along with the AUC-value. For the meta-analysis, the standard deviation was derived from the standard error or 95%-CI for studies not reporting the standard deviation [23]. If not provided, corresponding authors were contacted with the request to provide the necessary data to be included in the meta-analysis. Results of all appropriate studies were combined to metaanalyze the aggregated data. Then, meta-analyses were conducted on different subgroups of target conditions in order to estimate the accuracy of the algorithm for each condition separately. To be included in subgroup analysis, an additional criterium for the included studies was to describe a specific target condition (e.g., IDH and 1p/19q). Meta-analysis was performed with the use of IBM SPSS Statistics (IBM Corp. Released 2017. IBM SPSS Statistics for Windows, Version 25.0. Armonk, NY: IBM Corp.) and OpenMeta[Analyst] software (MetaAnalyst, Brown University EPBC [24]), which is the visual front-end for the R package (version 12.11.14) [25]. Results were displayed in forest plots. AUC-values of subgroups were compared by looking at the 95% confidence intervals and whether there was any overlap. The Higgins I2-test was used to test for heterogeneity between included studies with I2 > 75% deemed as considerable heterogeneity [23]. The Egger regression analysis was carried out to test for publication bias [26].

Results
A total of 1094 publications were initially retrieved through literature searches, and 724 remained after the removal of duplicates. After title and abstract screening, 215 fulltext articles were assessed for eligibility. After a full-text assessment, 60 and 17 studies were included for the systematic review and meta-analysis, respectively ( Figure 2). A total of 155 studies were excluded based on the full-text assessment for the following reasons: 66 studies described the segmentation of tumor-area instead of the classification of glioma; 23 studies did not report AUC as a performance metric; 14 studies discussed the classification of texture features instead of molecular characteristics of glioma; 14 studies had an incomplete description of the published data and/or methods; 14 studies did not describe an MLA model, but instead a number of combined features; 13 studies reported a too specific target condition (e.g., H3K27M and Ki-67); 5 studies used imaging techniques other than MR imaging; 3 studies had no internal validation-group; 3 studies included other brain tumors besides glioma.
Three studies [36,56,62] were included that described the classification of the TERT promoter mutation status in glioma. All studies used retrospectively collected data, and one study [36] was externally validated. Although still under investigation, it has been suggested that TERT promoter mutations characterize gliomas that require aggressive treatment [8]. All studies reported the validated AUC (range: 0.82-0.89). Additionally, the sensitivity and specificity ranged from 71% to 77% and 86% to 91%, respectively (n = 3).
For the subgroup meta-analysis of the classification studies with a focus on IDH mutation status, eight MLA algorithms, originating from seven studies [19,37,49,56,61,66,74], were included. Results show an overall AUC of 0.909 (95%-CI: 0.867-0.951), as seen in Figure 3. Moreover, heterogeneity between groups, measured with Higgins I2, was estimated as 90.402% (p < 0.001). The forest plot shows that the performance of the MLAs to classify molecular characteristics of glioma are centered around an AUC of 0.858 with a 95%-CI ranging from 0.812-0.904.
Three studies [32,56,64] were included in the subgroup meta-analysis of the 1p/19q codeletion status. Results of this subgroup analysis are displayed in Figure 4. The overall AUC is 0.748 (95%-CI: 0.699-0.797). Heterogeneity between groups was considered moderate (Higgins I2= 50.655% (p = 0.132)). Subgroup meta-analysis of MGMT promoter methylation status included three studies [35,56,65]; see Figure 5 for an overview of the results. The overall AUC of these MLA models was estimated as 0.866 (95%-CI: 0.812-0.921). Heterogeneity between the included studies was considered very low (Higgins I2= 0% (p = 0.453)). Three studies [36,56,62] were included in the subgroup analysis of the TERT promoter mutation status. Figure 6 displays the results of the meta-analysis with an estimated overall AUC of 0.842 (95%-CI: 0.783-0.901) and considered a low I2 heterogeneity of 0% (p = 0.582). Classification of the 1p/19q codeletion status showed to have a significantly poorer AUC when compared to other subgroup classifications, except for the TERT mutation status classification. No significant differences in performance between the other three subgroups (i.e., IDH, MGMT, TERT) were observed due to overlap of the 95% confidence intervals.

Testing for Publication Bias
Egger's regression test showed no significant publication bias with regard to MLAs to predict the molecular status of gliomas (p = 0.235).

Discussion
In this study, a number of studies that describe the classification of gliomas with the use of MLAs were reviewed and meta-analyzed. The overall performance of the classification tools as reported in AUC-values showed to be excellent. Subgroup analysis showed that the classification of 1p/19q codeletion status was significantly poorer than the classification performance of other molecular markers (i.e., IDH and MGMT). The observed heterogeneity between the included studies in the IDH-subgroup was considerably high.
The binary classification of various molecular characteristics of glioma with the use of artificial intelligence showed promising results with regard to future implementation in clinical practice. This implementation will have a significant impact on the care of glioma patients, as it could help to stratify patients for treatment options prior to undergoing surgery. However, clinically relevant studies need to be undertaken to increase the impact of these techniques in daily practice. For example, predicting 1p/19q codeletion status is more relevant in a subset of low-grade gliomas, as it enables to non-invasively distinguish IDH-mut astrocytoma (1p/19q intact) from oligodendroglioma (1p/19q codeleted) [62]. This is clinically relevant as the median survival of patients with these glioma subtypes is significantly different and can be impacted by the extent of the neurosurgical resection [86,87]. In addition, to further verify the performance of MLA methodologies, largerscale multi-center studies using prospective data are required. Moreover, despite the growing knowledge and use of MLA methodologies, integration in widespread clinical practice still faces some challenges [12]. One major challenge for this implementation concerns the generalizability of these systems, as they are mostly trained on small datasets lacking external validation [88]. Considering external validation as an additional inclusion criterium for a sub-analysis, we found that no meta-analysis could be performed. The twelve papers included in this review which validated their results externally investigated IDH mutation status (n = 2), 1p/19q codeletion status (n = 1), MGMT promoter methylation status (n = 2), PTEN gene mutation (n = 1), ATRX gene mutation (n = 1), TERT promoter mutation (n = 1), and various predictions with regard to WHO grading (n = 4), indicating that no pooled data can be acquired from these individual studies. Although the computeraided classification of glioma holds great potential, computer-obtained diagnosis is not likely to replace histopathologic diagnosis in the near future.

Implementation of Computer-Aided Approaches in Future Medicine
The diagnosis of different diseases by the use of MLAs is believed to hold great potential in modern medicine. The number of retrieved papers on this narrow topic is relatively limited, especially when compared to other reviews with a broader scope with regard to the use of artificial analysis in medical imaging analysis [22]. Nevertheless, conclusions and major limitations seem to be similar across fields. We can cautiously state that the accuracy of MLAs in the non-invasive classification of glioma holds great potential and is equal to or better than the predictions of healthcare professionals. On the other hand, the lack of external validation of the obtained results was recognized as the major limitation of the current scientific literature. Additionally, poor reporting is known to be prevalent in MLA studies, which limits reliable interpretation of the reported diagnostic accuracy and thereby hampers clinical implementation. Improving reporting and publication will enable greater confidence in the results of future evaluations of these promising technologies in medicine. When such improved confidence will be achieved, prospective evaluation should be carried out in the context of an intended clinical pathway. With regard to the context discussed in this paper, MLAs could be used on the preoperative imaging data, after which the predicted outcomes can be compared to the histopathological assessment after biopsy. Such implementation will help to elucidate whether important unknown covariates were present in the retrospective studies reviewed here. Thereafter, a randomized comparison could help to reveal and quantify possible clinical implications of implementing these MLAs in daily practice.
Furthermore, as recently suggested by Bhandari et al., a greater effort is needed to start translating these findings into an interpretable format for clinical radiology [89]. In addition, as several studies focused on singular molecular biomarkers in gliomas, it must be underlined several molecular alterations in astrocytic, oligodendroglial gliomas can occur in different combinations [79]. Consequently, a growing number of (and a combination of) different molecular tests are used to provide clinically relevant tissue-based biomarkers. Furthermore, the performance of MLAs to grade glioma according to the WHO grading score remains restricted by the interobserver variability of the neuropathological examination as reviewed by Van den Bent (2010) [90]. This clinically significant interobserver variation of the histological grading of glioma limits the diagnostic performance of other diagnostic tests (e.g., MLAs) as the neuropathological assessment was considered to be the ground truth.

Clinical Relevance of Computer-Aided Diagnosis
Automated diagnosis from medical imaging through artificial intelligence could help to overcome the mismatch between the increasing amount of diagnostic images and the capacity of available specialists [91]. More than 100 MLAs have now CE-marked, 57 of which can be used within neuroimaging features. Only four of these MLAs have been tailored to be used on neuro-oncology practices [91] (see https://grand-challenge.org/ aiforradiology/; accessed on 5 March 2021). Only one of these software packages claims to aid in tumor differentiation (The Brain Tumours Application; Hanalytics (BioMind; https: //biomind.ai/; accessed on 5 March 2021). This application focuses on the differentiation of 22 types of intracranial tumors on MRI scans, including the differentiation of astrocytoma, oligodendroglioma, and glioblastoma. Therefore, no software is commercially present to distinguish different molecular subtypes of glioma. Therefore, we conclude that the use of MLA models in daily radiological practice to non-invasively predict glioma subtype remains an important topic of future research in order to improve accuracy and commence external validation [91].

Strengths and Limitations
Meta-analysis of the aggregated MLA models showed high heterogeneity between included study groups. This heterogeneity could be expected, as multiple subgroups of target conditions are included in this analysis. Subgroup meta-analyses show significantly lower heterogeneity among included groups. However, for the IDH-subgroup, estimated heterogeneity still is remarkably high (I2 > 80%). Possible explanations for this heterogeneity could be the inclusion of multiple technically different MLA methodologies, multiple included MRI protocols with different sequences (e.g., T1-weighted, T2-weighted, diffusion-weighted), and the fact that there were no specific criteria set for the target population of glioma patients. The latter indicates that there is a good chance of variety between the included groups of patients. Although not supported by the results of the Egger's regression test, the presence of publication bias is not unlikely, there being little interest for classification tools with poor performance. The analyzed MLA methodologies showed excellent accuracy in the classification of multiple molecular characteristics of glioma. However, some deficiencies in the methods of this study should be considered. The quality of included articles was not formally assessed because no sufficient assessment tool is currently available for prediction models using MLA-techniques. The previously announced MLA variant of the TRIPOD statement could possibly be the solution to this problem for future research [22]. Moreover, external validation of the MLA methodologies was conducted for only 12 of the 60 studies. As internal validation commonly overestimates the performance, external validation of the described system is highly preferred. Lastly, a large number of the studies found in the initial search had missing reports of the validated AUC-value. Therefore, plural studies needed to be excluded from this review. Moreover, out of 60 studies included for qualitative synthesis, 43 studies did not note the 95% confidence interval, standard error, and/or standard deviation, which led to a restriction in the number of studies eligible for meta-analysis. For this matter, we recommend a standardized way of reporting MLA findings. In addition, as MLAs are statistical model-fitting methods, the reliability and performance of each study are at least partially dependent on the study sample size. Therefore, a large dataset next to the available BraTS dataset would be indispensable for this field of research.

Conclusions
This systematic review and meta-analysis show good accuracy for various MLA methodologies for the classification of molecular characteristics of gliomas, which could be beneficial for treatment planning. Remarkably, various studies did not perform external validation, causing significant limitations for these study results. Quality guidelines should be used when publishing studies on MLAs, including out-of-sample external validation and standardized reporting of obtained results.