Machine Learning in the Classification of Pediatric Posterior Fossa Tumors: A Systematic Review

Simple Summary Diagnosis of posterior fossa tumors is challenging yet proper classification is imperative given that treatment decisions diverge based on tumor type. The aim of this systematic review is to summarize the current state of machine learning methods developed as diagnostic tools for these pediatric brain tumors. We found that, while individual algorithms were quite efficacious, the field is limited by its heterogeneity in methods, outcome reporting, and study populations. We identify common limitations in the study and development of these algorithms and make recommendations as to how they can be overcome. If incorporated into algorithm design, the practical guidelines outlined in this review could help to bridge the gap between theoretical algorithm diagnostic testing and practical clinical application for a wide variety of pathologies. Abstract Background: Posterior fossa tumors (PFTs) are a morbid group of central nervous system tumors that most often present in childhood. While early diagnosis is critical to drive appropriate treatment, definitive diagnosis is currently only achievable through invasive tissue collection and histopathological analyses. Machine learning has been investigated as an alternative means of diagnosis. In this systematic review and meta-analysis, we evaluated the primary literature to identify all machine learning algorithms developed to classify and diagnose pediatric PFTs using imaging or molecular data. Methods: Of the 433 primary papers identified in PubMed, EMBASE, and Web of Science, 25 ultimately met the inclusion criteria. The included papers were extracted for algorithm architecture, study parameters, performance, strengths, and limitations. Results: The algorithms exhibited variable performance based on sample size, classifier(s) used, and individual tumor types being investigated. Ependymoma, medulloblastoma, and pilocytic astrocytoma were the most studied tumors with algorithm accuracies ranging from 37.5% to 94.5%. A minority of studies compared the developed algorithm to a trained neuroradiologist, with three imaging-based algorithms yielding superior performance. Common algorithm and study limitations included small sample sizes, uneven representation of individual tumor types, inconsistent performance reporting, and a lack of application in the clinical environment. Conclusions: Artificial intelligence has the potential to improve the speed and accuracy of diagnosis in this field if the right algorithm is applied to the right scenario. Work is needed to standardize outcome reporting and facilitate additional trials to allow for clinical uptake.


Introduction
Brain tumors are the second leading cause of death in children under 15 with an estimated incidence of 2-3.5 per 100,000 [1,2]. Posterior fossa tumors (PFTs) comprise 50-74% of childhood brain tumors, with the majority being juvenile pilocytic astrocytomas, medulloblastomas, ependymomas, and brainstem gliomas [3,4]. Central nervous system tumors in the pediatric population frequently present with nonspecific symptoms, which can lead to delays in diagnosis and treatment. One study found that the average time to diagnosis in a cohort of pediatric brain tumor patients was 7.7 months after symptom onset [5]. Given the rapid progression of some pediatric brain tumors, delays in diagnosis are associated with significant morbidity and mortality. Since treatment varies based on the type and grade of PFT, it is imperative to obtain an early diagnosis in this highly morbid group of malignancies. Histopathological diagnosis remains the standard of care for the diagnosis of PFTs. While accurate, this method is time consuming and requires a tissue specimen as well as access to a trained neuropathologist. While conventional magnetic resonance imaging (MRI) can be used to evaluate tumor location and impact on surrounding structures, it is of limited diagnostic value. Radiological differentiation between different PFTs is difficult and can be further complicated by tumor mimics such as demyelinating disorders and Alexander disease [6].
Some progress has been made to improve the diagnostic accuracy of imaging with the addition of advanced MR sequences such as diffusion-weighted imaging (DWI). Using apparent diffusion coefficient (ADC) ratios, radiologists in one study were able to discriminate pilocytic astrocytomas from ependymomas with a sensitivity of 83% and a specificity of 78% [7]. The discovery that individual radiomic and molecular features correlated to distinct PFTs led to the application of artificial intelligence for the diagnosis and subclassification of these tumors. Prior work has shown that artificial intelligence is becoming an increasingly viable tool with the potential to improve diagnostic speed and accuracy [8,9]. Machine learning has already been heavily implemented in the diagnosis of brain tumors in both children and adults, with previous studies reporting algorithms that can differentiate gliomas, meningiomas, and pituitary tumors based on extracted imaging features with accuracies as high as 99% [10][11][12]. Additional work has shown the possibility of using these methods to not only differentiate between tumor types, but also to subclassify tumors by grade, stage, and even molecular features [12][13][14]. Similar methods are now being explored to diagnose and classify PFTs. In this systematic review and meta-analysis, we aim to identify and critique all the primary literature that applies machine learning to the diagnosis and classification of pediatric PFTs. We analyze the algorithm architecture and efficacy as well as study parameters, strengths, and limitations to assess the clinical readiness of such technology, provide recommendations of best practices, and highlight areas for improvement. This work serves as a case study on how machine learning classification algorithms can be applied to clinical diagnosis with recommendations that can be applied to other pathologies.

Materials and Methods
This systematic review of the literature was completed according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [15]. Standardized electronic searches were conducted in PubMed, EMBASE, and Web of Science to identify relevant articles. Searches were conducted using conjugated "AND" and "OR" statements with keywords related to machine learning, artificial intelligence, and pediatric PFTs (Supplementary). Searches included all articles in the English language from database inception to 31 July 2022.

Inclusion and Exclusion Criteria
All observational studies, clinical trials, case reports, and technical papers assessing the use of machine learning to diagnose or classify PFTs based on molecular or radiomic features were included. No limit was placed on sample size or timeframe. Review articles, abstracts, conference abstracts, and primary papers that did not study the application of a machine learning algorithm (MLA) to the diagnosis or classification of a pediatric PFT met the exclusion criteria. Papers that specifically subclassified pediatric PFTs by other criteria, such as prognosis, response to treatment, etc., were also excluded.
Studies identified by the literature search were screened in two rounds, with the evaluation of appropriateness determined by consensus of the authors. Initially, title and abstract screening was conducted. Papers that met the exclusion criteria were excluded, and then a similar process was repeated with a full text review. Authors resolved all disagreements by consensus.

Data Extraction
Two authors independently extracted full texts of included articles into a standardized extraction table. Disagreements were decided by a two-author consensus. Data collected from each study covered study parameters including title and author, population size by tumor type, tumor type(s) being studied, study location(s), study timeframe, and ground truth used; algorithm parameters including type of input data, training set size, validation set size, test set size, method of image segmentation (manual vs. automatic), normalization used, presence/absence of texture analysis, deep learning model architecture, presence/absence of feature selection, and number of features extracted in final algorithm; algorithm performance statistics including sensitivity, specificity, accuracy, area under the curve (AUC), F1-score, Dice coefficient, positive predictive value, and negative predictive value; comparisons and analyses performed including comparison of the algorithm to a neuroradiologist, neuropathologist, or other clinical standard of care as well as the outcome of the comparison; and both algorithm as well as study limitations.

Gold Standard Comparison
For each paper that included a comparison of an MLA to a gold standard, the minimum and maximum AUCs or accuracies were collected for each method. The following calculations were conducted to compare the best-and worst-case efficacy of each diagnostic method: the difference between the maximum accuracy/AUC for the MLA and the minimum accuracy/AUC for the gold standard was computed. The same calculation was repeated with the maximum accuracy/AUC for the gold standard and the minimum accuracy/AUC for the MLA.

Algorithm Study Parameters and Design
Eight papers did not fully define the training or validation set employed [22,23,26,27,32,36,39,40]. Of those studies that did, most had a significantly larger training set than validation set. Bidiwala et al. [17] and Fetit et al. [21] both utilized cross-validation given their small sample sizes. Table 2. Summary of common machine learning classifiers used in the classification of posterior fossa tumors.

Classifier Algorithm Description
K-nearest neighbor Determines the probability a datapoint will fall into a group based on its distance from the group's members Support vector machine Assigns datapoints to one of two or more categories based on their locations on a space where the distance between the categories is maximized

High-Yield Features
Individual features important for the discrimination of PFTs were dependent on the dataset of origin. For generic T1-and T2-weighted imaging, extracted texture features were highly discriminative [21,24,25,27,33,34]. Most discriminative features from DWI were generated from ADC maps. These included ADC mean, ADC skewness, ADC energy, ADC entropy, ADC low grey level zone emphasis, and others [19,20,22,23,26,28,29]. For MRspectroscopy, mean spectra and lipid peaks were the main discriminators [18,30]. For methylation array data, individual CpG islands had the highest discriminative value [38]. For classifiers generated from microscopy data, nuclear density, tumor-associated macrophage density, nuclear compactness, and maximum radius were most important for discrimination [39].

Comparison to Neuroradiologist
The efficacies of the developed MLAs were compared to those of a trained neuroradiologist in seven cases ( Figure 3). Algorithms developed by Bidiwala et al. [17], Davies et al. [18], and Fetit et al. [21] all outperformed the neuroradiologist at both best-case and worst-case reported accuracies/AUCs. Of note, Davies et al. [18] was the only study to compare a radiologist to a radiologist augmented by MLA. Results were equivocal for Arle et al. [16], Quon et al. [31], and Zhou et al. [37]. At the maximum reported accuracy/AUC, these algorithms outperformed the standard of care, but at the lower end of reported functioning, these algorithms were inferior to the standard of care at its optimal performance. Payabvash et al. [28] could not be assessed compared to a neuroradiologist because overall accuracy/AUC was not provided for each MLA being evaluated. Leslie et al. [40] reported additional diagnostic accuracies of 85%, 96%, 61%, and 75% for astrocytomas, gliomas, oligodendrogliomas, and gangliogliomas, respectively.

Comparison to Neuroradiologist
The efficacies of the developed MLAs were compared to those of a trained neuroradiologist in seven cases ( Figure 3). Algorithms developed by Bidiwala et al. [17], Davies et al. [18], and Fetit et al. [21] all outperformed the neuroradiologist at both best-case and worst-case reported accuracies/AUCs. Of note, Davies et al. [18] was the only study to compare a radiologist to a radiologist augmented by MLA. Results were equivocal for Arle et al. [16], Quon et al. [31], and Zhou et al. [37]. At the maximum reported accuracy/AUC, these algorithms outperformed the standard of care, but at the lower end of reported functioning, these algorithms were inferior to the standard of care at its optimal performance. Payabvash et al. [28] could not be assessed compared to a neuroradiologist because overall accuracy/AUC was not provided for each MLA being evaluated.

Observed Limitations
The limitations of the studied MLAs were divided into methodologic limitations and algorithmic limitations (Table 5). Methodologic limitations relate to study design, the generation of data, and the training of the algorithm. Most major limitations observed were methodologic. Nineteen algorithms (76%) used retrospectively collected data and 18 algorithms (72%) were trained or validated on small samples of fewer than 50 patients, many with incomplete radiographic or molecular datasets [
Algorithms were heterogeneously crafted and studied. For example, where Arle et al. [16] extracted 20 features to classify 33 tumors using a single NN, Zhang et al. [35] extracted over 1800 features from 527 patients using an ensemble of six different classifiers. Given the vast number of available features to be extracted from multiple data streams, classifier combinations to be applied, and methods of performance analysis to be employed, success in this space depended on the algorithm creators' ability to select the proper data and methods for the desired goal.

Algorithm Selection
The machine learning approaches employed a variety of classification algorithms to discriminate between PFTs. Surprisingly, while there was some variation, all of the classifiers yielded fairly high accuracies in the individual diagnosis of ependymoma, medulloblastoma, and pilocytic astrocytoma. Instead, there were significant differences observed in the overall accuracy of the MLAs. It is possible that the variation in accuracies reported by studies employing the same MLAs explains some of this discrepancy. Furthermore, algorithm accuracies were only reported on a per-tumor basis in a minority of studies. Studies reporting positive results may be more likely to publish these tumor-specific performance metrics.
LDA, kNN, and RF algorithms had the lowest accuracies with significant variation in the reported results. LDA, while simple to implement, is often critiqued as not being expressive enough to appreciate complex differences between groups [41]. kNN methods, while commonly used, are highly sensitive to dataset size and quality, which may serve to explain the poor performance in the small, unbalanced PFT datasets used in the model training. Additionally, kNN algorithms depend on a knowledgeable operator given the difficulty of choosing a proper k for a given training set [42]. RF models, lauded as a fast ensemble method of classification, are unable to extrapolate datapoints outside the range of the training set and respond poorly to noisy datasets [43]. All three poorly performing algorithms commonly rely upon a broad, high-quality training set, which may have been lacking in these cohorts. While these MLAs have their merits, caution should be employed when applying these methods on small, unbalanced datasets.
The highest-performing 3-way classifiers utilized PNN and NB algorithms. While computationally demanding, PNNs are some of the most effective MLAs in terms of their accuracy and outlier handling [44]. Additionally, PNNs have a history of success in the classification of brain tumors [45]. PNNs are also well-suited to training on a large dataset, compared to other MLAs. NB classifiers are intuitive, scalable, efficient, and robust to outliers. While they assume independence between all features, a higher degree of independence can be insured through the use of feature selection [46]. Both techniques offer the advantages of high accuracy despite the presence of outliers, which may explain their applicability in PFT diagnostics.
With moderate classification accuracies, SVMs were the most frequently employed classifier in this cohort. Given that SVM methods perform well on high dimensionality and unstructured data, such as that derived from imaging, an SVM classifier is a good fit for the PFT classification problem [47]. These benefits come with the associated challenges of long training times and difficulty choosing a proper kernel function [48]. SVM models are additionally known to underperform when trained on datasets that contain significantly more variables than data specimens, which may explain the lackluster results in these cohorts [49].

Objective of Machine Learning Application
Machine learning has generated much excitement as a potential driver of cost reduction and improved diagnostic accuracy in clinical practice. Diagnostic interpretation by a radiologist has previously been shown to be highly operator-dependent, a problem that is further magnified in the diagnosis of PFTs, which have many overlapping radiographic features [50]. Multiple studies have shown that machine learning approaches have improved diagnostic efficacy when compared to their human counterparts [51][52][53]. Imaging-based MLAs applied to glioma diagnosis have shown the potential to improve clinical decision making regarding the diagnosis and management of adult glioma patients [54]. In fact, an artificial intelligence-driven, MRI-based brain tumor diagnostic program has already been integrated into clinical practice with some success [11]. The implementation of a similar platform in the diagnosis of pediatric PFT patients could preclude the need for an invasive biopsy and decrease time to diagnosis. While surgical resection is typically standard of care for these patients, neoadjuvant chemotherapy is sometimes performed [55]. An increased confidence in the diagnosis would allow for the better tailoring of treatment; for example, the importance of obtaining a surgical gross total resection is much greater for improved outcomes in ependymoma compared to medulloblastoma. Finally, the application of MLAs in this space allows for a diagnosis to be obtained in resource-poor settings where a trained neuroradiologist, neurosurgeon, and neuropathologist are not always available.

Translation to Clinical Practice
An algorithm is ready for the clinical environment if it can perform with the equivalent efficacy to the clinical alternative and has demonstrated reliability when applied prospectively in clinic; however, comparison to a clinical standard is often difficult. Given that the standard of care is a pathological diagnosis, little clinical benefit is generated from algorithms that can make a comparable diagnosis based on a tissue sample. Instead, the true clinical value of MLAs is derived from improvements made in the diagnostic accuracy of non-invasive data sources such as imaging. Only seven of the algorithms identified in this review made any performance comparison to a neuroradiologist [16][17][18]21,28,31,37]. Of these, only three definitively outperformed the radiologist. For example, even at the worst-case performance, the algorithm developed by Bidiwala et al. [17] showed a 14% greater accuracy compared to the highest reported accuracy of a neuroradiologist [18,21]. The remaining four studies were more equivocal and developed algorithms that could outperform the radiologist under ideal conditions, but then underperformed in the diagnosis of certain tumor subtypes or when specific classifiers were applied [16,28,31,37]. Given the heterogeneity of developed algorithms in this space, no generalization can be made regarding algorithm performance as compared to a radiologist. However, it seems that under specific conditions, a minority of the posterior fossa classification algorithms can consistently improve the diagnostic accuracy compared to trained neuroradiologists. Unfortunately, no analysis is possible for the molecular diagnostic algorithms given that none of these algorithms were compared to a clinical alternative. In addition, these methods still require a biopsy and no study examined other factors that may justify clinical use, such as improved cost or efficiency compared to diagnosis by a neuropathologist.
Regarding the second standard, a lack of application in the clinical environment is the true barrier to clinical integration of these algorithms. Not a single algorithm from the 25 studies identified in this review was trialed in the clinic. While six algorithms had prospective data collection, they did not apply patient data in real time to yield a diagnosis, as would be expected in a real clinical workflow. Davies et al. [18] took the added step of assessing algorithm performance as an adjunct to a neuroradiologist's decision making, but this still occurred outside of the clinic. A common critique of MLAs is that the results of theoretical research studies are poorly reproduced when algorithms are used in real time on actual patients [56]. Given the baseline resistance to the clinical uptake of any new technology, such clinical studies are imperative to convince clinicians of the safety and efficacy of these algorithms.

Algorithm Limitations
Limitations in the study and efficacy of these MLAs can be divided into (1) those that are inherent to machine learning methods and (2) those that can be improved with proper study design. Many of the uncontrollable limitations come from the feature extraction stage. Proper feature extraction depends on high signal-to-noise ratios generated from highresolution imaging. ADC sequences, a common MR-generated sequence used in algorithms classifying PFTs, have inherently lower scan resolution which translates to greater noise, especially when compared to T1-and T2-weighted sequences [31]. Increased noise is difficult to control for and makes the extraction of clinically meaningful imaging characteristics more difficult. Feature extraction from imaging is also limited by the quality of the predominantly manual region of interest delineation and segmentation processes [31]. Seventeen of the included studies featured manual image segmentation with minimal quality control for proper results. Manual segmentation is time consuming and highly operator-dependent, introducing bias into any cohort [57,58]. However, automatic segmentation is not always preferred as it is often ambiguous how the segmentation algorithm defines the region of interest. Finally, the inherent variation between the scans captured by different machines with different calibration methods makes uniform analysis challenging [59]. This limitation is especially relevant to studies that spanned different centers, such as those completed by Zhang et al. [33] or Quon et al. [31], which must contend with sequences captured by different machine makes and models. MLAs rely on minor differences in characteristics between cases to make classification decisions so these minor variations between machines have the potential to alter results.
Many of the methodological limitations commonly observed in machine learning classifiers of PFTs are correctable. The most salient limitation is the small sample sizes used to train and validate these algorithms. Only eight studies reported a training set of over 100 samples and an even fewer three studies reported similarly large validation sets [20,28,31,[33][34][35][36][37][38]. This scale (~100 samples) is significantly smaller than that used by the training sets for deep learning models, which typically require at least~1000 to~10,000 samples in the supervised setting. Algorithms trained on small sample sizes commonly overfit data, yielding an overestimated accuracy [60]. Furthermore, Bidiwala et al. [17] and Fetit et al. [21] employed a cross-validation method in which one sample was withheld from the training set, the model was trained, and then validation was completed on the single remaining sample. This process was then repeated for all samples and then the results were aggregated. While this is an understandable approach when dealing with small sample sizes, as these authors were, it can also lead to highly inconsistent results and is prone to overfitting [61]. Similarly, 36% of the studied algorithms varied in accuracy by tumor type. On average, these algorithms performed the worst with the classification of ependymoma. This is most attributable to the relative under-representation of ependymoma samples in these unbalanced datasets with 10 studies each featuring fewer than 20 ependymoma samples. While this is not surprising given that ependymomas only represent 8% to 15% of PFTs, the improved representation of rare tumors in these cohorts would improve the overall accuracy of the generated algorithms [62]. Oversampling would provide one potential methodologic solution to address the rare tumor problem. However, algorithm developers must strike a balance to not oversample to an extent that there is overgeneralization of the minority class [63].
Potentially the most actionable limitations relate to data use and method reporting. Most of the included studies (76%) were retrospective, which limits generalizability. Authors frequently used incomplete radiographic or molecular data as inputs. While maintaining a low bar for data inclusion increases sample sizes and generalizability, accuracy would be improved if only complete cases were included. Nine papers additionally lacked sufficient detail in their methods to determine the training or validation set size, number of clinical sites involved, or method of feature extraction [22,23,26,27,32,35,[38][39][40]. Machine learning approaches are commonly critiqued as being "black boxes" to their users [64]. The ambiguous definition of the methods and inconsistent reporting of performance metrics serves to further reinforce this criticism and will continue to impede progress if changes are not made.

Posterior Fossa Algorithm Recommendations of Best Practice
We make the following suggestions of best practices for the development of PFT classification algorithms based on our analysis of the algorithm performance and limitations. From a procedural standpoint, most algorithms followed the commonly accepted framework of image acquisition, normalization, feature extraction, dimensionality reduction through feature selection, and classification [65]. Preprocessing and filtering prior to extraction increase the resolution of the extracted imaging features and the subsequent dimensionality reduction removes noise and random error, increasing accuracy [34,[65][66][67]. The majority of the classification algorithms identified in this review applied such techniques, which partially explains the high accuracies reported across many algorithms. This process should continue to be employed. While radiomics-based MLAs have classically applied a single classifier on one set of inputs, Zhang et al. [33] highlighted the value of ensemble classifiers that can identify the combination of models with the highest efficacy. As an individual algorithm's efficacy varied by tumor type, it is necessary to trial multiple combinations of classifiers to identify the ideal system for the specific problem [17,[19][20][21][22][23][26][27][28][29][30]35,37,68].
Additionally, algorithms should be trialed prospectively in the clinic with large training and validation sets that equally represent all included tumor types. This recommendation holds especially true for ependymoma, which, while rare, was consistently underrepresented in the PFT cohorts being analyzed [16][17][18]21,27,30]. While no one-size-fits-all cohort size can be recommended, a minimum sample size should be chosen to ensure that results are adequately powered. In situations where available data are limited, such as with ependymoma, other machine learning methods can be employed, including model pretraining, semi-supervised learning, or self-supervised learning [69,70]. Standardized results reporting is also necessary to facilitate algorithm comparison and assessment. Each study should report, at minimum, AUC, accuracy, sensitivity, and specificity on both an aggregated as well as a per-tumor basis. One approach to ensure standardized performance analysis involves the curation of a benchmark dataset on which different models can be compared fairly and reproducibly, as has already been implemented with other radiographic data [71,72]. Such a dataset should be derived from multiple centers and contain representative and balanced data, with clear training, validation, and testing subsets. Finally, to address the concern for poor transparency in algorithm development and function the following steps can be taken: (1) local features can be aggregated to give a sense of the overall model, (2) methods such as the "predictive, descriptive, relevant" framework described by Murdoch et al. [73] or the NTRPRT guideline developed by Chen et al. [74] can be utilized to ensure that algorithms are maximally interpretable, and (3) uncertainty measures can be included in model predictions to flag when the model is prone to misclassifications and highlight when human intervention may be required [75,76].
Algorithms developed from MR imaging, microscopy slides, and molecular data were all similarly efficacious [17,21,[38][39][40]. While algorithms that improve the cost or speed of tissue diagnosis still have clinical value, algorithms developed from imaging data should be prioritized as computed tomography (CT) and MR are significantly less invasive than tissue collection.

Limitations & Future Directions
This systematic review has some limitations. Papers were only sampled through 31 July 2022, so any additional algorithms classifying PFTs published since have not been included. While potentially clinically useful, this analysis excluded algorithms classifying PFTs by molecular subtypes or prognosis to facilitate the easy comparison of identified algorithms. Algorithm critiques were based solely on the published description of the algorithm at the time this paper was written. Additional data or documentation covering algorithm operation or performance published elsewhere may not be included in this analysis.
The algorithms reported in this paper offer many different approaches to the classification and diagnosis of PFTs based on imaging or molecular features. While some of these methods are compared to a clinical standard, such as a neuroradiologist, many are not. Additional work is needed to make these comparisons to the standard of care and, more importantly, to study the efficacy of these algorithms in the clinical environment. It is postulated that the true clinical integration of machine learning will manifest as a symbiosis between the physician and the developed algorithms instead of the algorithm replacing the physician [77]. Thus, further work is also needed to investigate how physicians interact with these algorithms and how neuroradiologists or neuropathologists can apply these methods to further improve diagnostic accuracy. As previously discussed, a significant barrier to the clinical implementation of machine learning classification algorithms is the methodologic limitations in algorithm design and testing. While proposed solutions are resource-intensive, they seek to make this complex technology more digestible to the typical physician, who is not well-versed in machine learning methods. Multi-institutional collaborations in the field could allow for resource pooling, access to larger sample sizes, and increased exposure of MLAs to industry stakeholders.

Conclusions
Overall, machine learning has the potential to improve diagnostic speed and accuracy for pediatric PFTs. Developed algorithms focused on the classification of medulloblastoma, pilocytic astrocytoma, and ependymoma with inconsistent results. Individual algorithms reported exceptional performance metrics while others yielded suboptimal outcomes. While a minority of algorithms consistently outperformed the current clinical standard of care, most were nonsuperior or lacked such a comparison. Common limitations include poor methods of reporting, use of small sample sizes, under-representation of certain tumor types such as ependymoma, and methodological limitations inherent to the development of MLAs. The advancement of these algorithms to clinical use will necessitate adherence to consistent data reporting standards, training, and validation in larger sample sizes, prospective trials in real-time clinical workflows, and the study of algorithms as an adjunct to the current standard of care rather than as a replacement.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/cancers14225608/s1, Table S1: Summary of performance metrics for machine learning algorithms that discriminate between both common and rare posterior fossa tumors.