Automated Identification of Multiple Findings on Brain MRI for Improving Scan Acquisition and Interpretation Workflows: A Systematic Review

We conducted a systematic review of the current status of machine learning (ML) algorithms’ ability to identify multiple brain diseases, and we evaluated their applicability for improving existing scan acquisition and interpretation workflows. PubMed Medline, Ovid Embase, Scopus, Web of Science, and IEEE Xplore literature databases were searched for relevant studies published between January 2017 and February 2022. The quality of the included studies was assessed using the Quality Assessment of Diagnostic Accuracy Studies 2 tool. The applicability of ML algorithms for successful workflow improvement was qualitatively assessed based on the satisfaction of three clinical requirements. A total of 19 studies were included for qualitative synthesis. The included studies performed classification tasks (n = 12) and segmentation tasks (n = 7). For classification algorithms, the area under the receiver operating characteristic curve (AUC) ranged from 0.765 to 0.997, while accuracy, sensitivity, and specificity ranged from 80% to 100%, 72% to 100%, and 65% to 100%, respectively. For segmentation algorithms, the Dice coefficient ranged from 0.300 to 0.912. No studies satisfied all clinical requirements for successful workflow improvements due to key limitations pertaining to the study’s design, study data, reference standards, and performance reporting. Standardized reporting guidelines tailored for ML in radiology, prospective study designs, and multi-site testing could help alleviate this.


Introduction
Brain magnetic resonance imaging (MRI) is recognized as the imaging modality that produces the best images of brain tissues, body fluids, and fat [1]. It remains the most appropriate modality for diagnosing patients with symptoms of multiple brain diseases including inflammatory diseases, dementia, neurodegenerative disease, cerebrovascular disease, and brain tumors [2][3][4][5]; hence, it plays an important role in multiple clinical scenarios ranging from acute diagnostics to routine follow-ups. A brain MRI scan typically consists of several scan sequences, the most commonly included being T1-weighted (T1) and T2-weighted (T2) sequences, a diffusion-weighted imaging (DWI) sequence, a fluid attenuated inversion-recovery (FLAIR) sequence, and a bleeding sensitive sequence, e.g., T2* gradient-recall-echo (T2*-GRE) [6]. Selecting the appropriate sequences a priori can be challenging, because many brain diseases often orrhage", "subarachnoid hemorrhages", and "subdural hemorrhages", were also included in the search string. The full search string can be found in Appendix A.

Study Selection
Records that developed ML algorithms for the automated identification of normal and abnormal brain diseases were screened. Main inclusion and exclusion criteria are listed in Table 1. Table 1. Inclusion and exclusion criteria.

Inclusion Criteria:
Exclusion Criteria: Studies focusing on abnormal brain diseases that included either brain infarct, hemorrhage, or tumor on brain MRI Studies focusing on tasks not relevant for identification of brain diseases Studies developing algorithms tested on a dataset that was separate from the training dataset Studies focusing on identification of a single brain disease only Peer-reviewed studies in English Studies focusing on development of ML for specialized MR sequences (e.g., MR elastography, functional MRI) or other imaging modalities (e.g., SPECT, PET, CT, US) Studies with primarily non-adult populations Editorials, case series, letters, conference proceedings, reviews, and inaccessible papers Two medical doctors (K.S. and C.M.O.) served as reviewers. They independently screened all records based on title and abstract. This was followed by the extraction of relevant reports for full-text screening and final study inclusion. For the process of record and report screening, Covidence (Melbourne, Australia) was used. Discussions between both reviewers were held to resolve any conflicts, but if a consensus was not reached, a third reviewer (J.F.C.) was consulted.

Data Extraction and Analysis
The reviewers independently extracted data from the included studies according to a pre-defined datasheet. Study and algorithm characteristics were extracted and comprised of the following: (a) study information, (b) population/dataset characteristics including number of patients or images, pathology in the dataset, and MR sequences available, (c) aim of algorithms, (d) type of algorithm, and (e) training and testing strategies including how data-splits were performed. Reported performance metrics together with confidence intervals were also extracted, including accuracy, sensitivity, specificity, F1-score, negative predictive value (NPV), positive predictive value (PPV), and area under the receiver operating characteristic curve (AUC). Furthermore, the Dice score coefficient (DSC), which is one of the most common evaluation metrics used in medical segmentation tasks [21], was extracted where applicable in brain segmentation studies. Performance numbers were summarized using descriptive statistics. If multiple results were reported for different variations of the same algorithm, only the best performance result was extracted unless otherwise stated. When available, performance results were extracted from external test datasets. Included studies were divided by tasks of the included algorithms. The analysis of data was primarily conducted using pivot tables and the in-built analysis tools of Microsoft Excel.

Quality Assessment of Included Studies
The two reviewers independently assessed the quality of the included studies by using the tailored questionnaire Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) [22] with signaling questions covering risk of bias and concern for applicability in the domains of patient selection, index test, reference test, and flow and timing. For each study, the respective domains were graded as high-, unclear-, or low risk of bias/concern for applicability. Discordance between the reviewers was resolved through discussion.

Evaluation of Applicability for Workflow Improvements
The applicability of each included ML algorithm for improving scan acquisition and interpretation workflows was qualitatively assessed based on three essential requirements previously mentioned in the Introduction (Section 1): (A) reflection of clinical practice, (B) testing on an external out-of-distribution dataset, and (C) acceptable performance results. Each requirement was indicated as being 'Satisfied' (S) or 'Not Satisfied' (NS). The first requirement (A) was satisfied if the patient population was consecutively sampled, if the disease distribution was well reported, and if the study was assessed as having low risk of bias/concern for applicability for the review question. The second requirement (B) was satisfied if external test datasets with data from a different time-period and geographical location were used to produce performance results. Lastly, the third requirement (C) was graded satisfied if a majority of the abovementioned result metrics exceeded a predefined threshold of 85% of the maximum attainable value. This threshold for acceptable performance was selected because it reflected the performance levels of a neuroradiologist when performing similar disease identification tasks [23].

Study Selection and Data Extraction
The search of electronic databases returned 5688 records. The removal of duplicates resulted in 3542 records. Screening record titles and abstracts resulted in 81 reports selected for full-text eligibility assessments, of which 19 studies were included for qualitative review. The study inclusion process is illustrated in Figure 1.
Details about the study's characteristics are summarized in Table 2. All 19 studies included were of retrospective design. Study populations varied with regard to source and size. Twelve out of nineteen studies (63%) used public datasets for development and testing of their algorithms. These datasets included The Whole Brain Atlas from Harvard Medical School (HMS) [24], The Cancer Imaging Archive (TCIA) [25], the Brain Tumor Segmentation (BRATS) challenge dataset [26], and the Ischemic Stroke Lesion Segmentation (ISLES) challenge set [27]. Study populations varied between 100 and 500 patients with one study (5%) having a population of less than 100 patients and five studies (26%) having a population larger than 1000 patients. All large study populations were private, i.e., part of a local in-house dataset not publicly available to researchers outside of the research institution in question. Six studies (31%) only reported on data size as the number of 2D images ranging from 200 images to 4600 images. Training, validation, and testing of algorithms were on average performed using 69%, 3%, and 28% of all available data, respectively. Validation was performed only in six (31%) studies. Testing was mostly performed on data split out from the same data source; however, an external dataset with data from a different time-period and geographical location was used in three (15%) studies.    All studies developed algorithms focusing on brain disease identification using either classification or segmentation tasks. Seven studies (37%) focused on a binary classification of images into either normal/abnormal or differentiation between two diseases, five studies (26%) focused on a multiclass classification of images into specific disease categories, and seven studies (37%) focused on a multiclass segmentation of specific diseases. Most algorithmic tasks employed deep discriminative models, with 14 (74%) studies using convolutional neural networks (CNN). Three (16%) studies employed deep generative models, with two (11%) studies using variational autoencoders (VAE) and one (5%) study using generative adversarial networks (GANs). Reference tests were mostly labels and delineations made by neuroradiologists. Exceptions were found in the study by Wood et al. [46] where reference labels were generated using natural language processing (NLP) of radiological reports and in the study by Ahmadi et al. [28], where reference delineations were constructed using principal component analysis (PCA). Details about the test setup and performance metrics for binary classification, multiclass classification, and segmentation algorithms are summarized in Tables 3 and 4, respectively.
Different performance measures were reported for each study. For classification studies, the most frequently reported performance metrics were AUC, accuracy, sensitivity, and specificity. AUC ranged from 0.765 to 0.997 while accuracy, sensitivity, and specificity ranged from 80% to 100%, 72% to 100%, and 65% to 100%, respectively. Positive predictive and negative predictive values were reported in nine studies (47%) and ranged from 12% to 94% and 48% to 99%, respectively. The higher performance values were predominantly observed in binary classification studies with a smaller study population, while the lower values were seen when identifying brain tumors. For segmentation studies, the Dice Score Coefficient was the most reported measure ranging from 0.300 for infarct segmentations to 0.912 for glioma and multiple-sclerosis segmentations. Sensitivity and specificity were observed to range from 13% to 99.9% and 87% to 99.8%, respectively, with the lower sensitivity values attributed to brain infarct segmentations.

Applicability to Workflow Improvement
The applicability of the included ML algorithms for improvements in scan acquisition and interpretation workflows was evaluated based on the satisfaction of the three requirements of (A) testing environments reflecting clinical practice, (B) test on external outof-distribution datasets, and (C) acceptable algorithm performance results; see Section 2. Evaluation results for each requirement are summarized in Tables 3 and 4 as well. Ten (53%) of nineteen studies were assessed as having acceptable performance; however, only one (5%) satisfied the requirement of being tested in a clinical environment reflecting clinical practice and three (15%) satisfied the requirement of testing on an external out-ofdistribution dataset. Three studies (15%) satisfied two main requirements. These studies used privately acquired datasets. No studies satisfied all three main requirements for successful workflow integrations.

Quality Assessment
The Quality Assessment of Diagnostic Accuracy Studies 2 tool was applied to all included studies in this review. The results of the risk of bias/concern for applicability analysis are presented in Table 5 and summarized in Figure 2.       Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = l and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of and concern for applicability; = high risk of bias and concern for applicability; = unclear of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability. Summary of risk of bias and concern for applicability for all includ and concern for applicability; = high risk of bias and concern fo of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bia and concern for applicability; = high risk of bias and concern for applicability; = unclear ris of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = l and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of and concern for applicability; = high risk of bias and concern for applicability; = unclear of bias and concern for applicability.
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bia and concern for applicability; = high risk of bias and concern for applicability; = unclear ris of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = l and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of and concern for applicability; = high risk of bias and concern for applicability; = unclear of bias and concern for applicability. Summary of risk of bias and concern for applicability for a and concern for applicability; = high risk of bias and co of bias and concern for applicability. Summary of risk of bias and concern for ap and concern for applicability; = high risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bia and concern for applicability; = high risk of bias and concern for applicability; = unclear ris of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Pat
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of bia and concern for applicability; = high risk of bias and concern for applicability; = unclear ris of bias and concern for applicability.

R Flo
Ahmadi et al., 2021 [28] Summary of risk of bias and concern for applicability for all included studies. = low risk of and concern for applicability; = high risk of bias and concern for applicability; = unclear of bias and concern for applicability.

Source
Ahmadi et al., 2021 [28] Summary of risk of bias and concern for ap and concern for applicability; = high risk of bias and concern for applicability.

Lu et al., 2021 [39]
Patie Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Patie
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Patie
Summary of risk of bias and concern for applicability for all included studies. = low risk of bia and concern for applicability; = high risk of bias and concern for applicability; = unclear ris of bias and concern for applicability.

Patie
Ahmadi et al., 2021 [28] Summary of risk of bias and concern for ap and concern for applicability; = high risk of bias and concern for applicability.
Lu, Lu et Zhang., 2019 [40] Source Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Source
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Source
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Source
Ahmadi et al., 2021 [28] Summary of risk of bias and concern for applicability for all included studies. = low risk of and concern for applicability; = high risk of bias and concern for applicability; = unclear of bias and concern for applicability.

Source
Summary of risk of bias and concern for applicability for all includ and concern for applicability; = high risk of bias and concern fo of bias and concern for applicability.  Table 5. Presentation of risk of bias/concern for applicability analysis results.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Risk of Bias
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Risk of Bias
Summary of risk of bias and concern for applicability for all included studies. = low risk of bia and concern for applicability; = high risk of bias and concern for applicability; = unclear ris of bias and concern for applicability.

Flow and Timing Patient Selection
Summary of risk of bias and concern for applicability for all included studies. = low risk of and concern for applicability; = high risk of bias and concern for applicability; = unclear of bias and concern for applicability.

Index Test
Summary of risk of bias and concern for ap and concern for applicability; = high risk of bias and concern for applicability.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.  Table 5. Presentation of risk of bias/concern for applicability analysis results.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Table 5. Presentation of risk of bias/concern for applicability analysis results.

Flow and Timing Patient Selection
Summary of risk of bias and concern for applicability for all included studies. = l and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability. Summary of risk of bias and concern for applicability for all included studies. = low risk of and concern for applicability; = high risk of bias and concern for applicability; = unclear of bias and concern for applicability. Table 5. Presentation of risk of bias/concern for applicabil Source

Index Test
Summary of risk of bias and concern for applicability for a and concern for applicability; = high risk of bias and co of bias and concern for applicability. Summary of risk of bias and concern for applicability for all includ and concern for applicability; = high risk of bias and concern fo of bias and concern for applicability.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Table 5. Presentation of risk of bias/concern for applicability analysis results.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability. Table 5. Presentation of risk of bias/concern for applicability analysis results.

Flow and Timing Patient Selection
Summary of risk of bias and concern for applicability for all included studies. = l and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability.

Patient Selection
Summary of risk of bias and concern for applicability for all included studies. = low risk of and concern for applicability; = high risk of bias and concern for applicability; = unclear of bias and concern for applicability.

Index Test
Summary of risk of bias and concern for applicability for a and concern for applicability; = high risk of bias and co of bias and concern for applicability. Table 5. Presentation of risk of bias/concern for applicability analy Source Risk of Bias

Index Test
Summary of risk of bias and concern for applicability for all includ and concern for applicability; = high risk of bias and concern fo of bias and concern for applicability.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.  Table 5. Presentation of risk of bias/concern for applicability analysis results.

Patient Selection
Index Test Reference Test

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.  Table 5. Presentation of risk of bias/concern for applicability analysis results.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bia and concern for applicability; = high risk of bias and concern for applicability; = unclear ris of bias and concern for applicability.
Diagnostics 2022, 12, x FOR PEER REVIEW Table 5. Presentation of risk of bias/concern for applicability analysis results.

Flow and Timing Patient Selection
Summary of risk of bias and concern for applicability for all included studies. = l and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability.

Patient Selection
Summary of risk of bias and concern for applicability for all included studies. = low risk of and concern for applicability; = high risk of bias and concern for applicability; = unclear of bias and concern for applicability.

Patient Selection
Summary of risk of bias and concern for applicability for all included studies. and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability.  Summary of risk of bias and concern for ap and concern for applicability; = high risk of bias and concern for applicability. Rauschecker et al., 2020 [23] ostics 2022, 12, x FOR PEER REVIEW 15 of 22 Table 5. Presentation of risk of bias/concern for applicability analysis results.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.  Table 5. Presentation of risk of bias/concern for applicability analysis results.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.  Table 5. Presentation of risk of bias/concern for applicability analysis results.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bia and concern for applicability; = high risk of bias and concern for applicability; = unclear ris of bias and concern for applicability.
Diagnostics 2022, 12, x FOR PEER REVIEW Table 5. Presentation of risk of bias/concern for applicability analysis results.

Flow and Timing Patient Selection
Summary of risk of bias and concern for applicability for all included studies. = l and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability.
Diagnostics 2022, 12, x FOR PEER REVIEW Table 5. Presentation of risk of bias/concern for applicability analysis r

Flow and Timing Patient
Summary of risk of bias and concern for applicability for all included s and concern for applicability; = high risk of bias and concern for app of bias and concern for applicability.
Diagnostics 2022, 12, x FOR PEER REVIEW Table 5. Presentation of risk of bias/concern for applicability analysis

Flow and Timing
Ahmadi et al., 2021 [28] Summary of risk of bias and concern for ap and concern for applicability; = high risk of bias and concern for applicability. Wood et al., 2022 [45] ostics 2022, 12, x FOR PEER REVIEW 15 of 22 Table 5. Presentation of risk of bias/concern for applicability analysis results.

Patient Selection
Index Test Reference Test

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.  Table 5. Presentation of risk of bias/concern for applicability analysis results.

Patient Selection
Index Test Reference Test

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.
Diagnostics 2022, 12, x FOR PEER REVIEW Table 5. Presentation of risk of bias/concern for applicability analysis results.

Patient Selection
Index Test Reference Test

Flow and Timing Patient Selection
Summary of risk of bias and concern for applicability for all included studies. = l and concern for applicability; = high risk of bias and concern for applicability; of bias and concern for applicability.
Diagnostics 2022, 12, x FOR PEER REVIEW Table 5. Presentation of risk of bias/concern for applicability analysis r

Patient Selection
Index Test Reference Test

Flow and Timing Patient
Summary of risk of bias and concern for applicability for all included s and concern for applicability; = high risk of bias and concern for app of bias and concern for applicability.
Diagnostics 2022, 12, x FOR PEER REVIEW Table 5. Presentation of risk of bias/concern for applicabil Source

Index Test
Summary of risk of bias and concern for applicability for a and concern for applicability; = high risk of bias and co of bias and concern for applicability.
Diagnostics 2022, 12, x FOR PEER REVIEW Table 5. Presentation of risk of bias/concern for applica Source

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.
= low risk of bias and concern for applicability; stics 2022, 12, x FOR PEER REVIEW 15 of 22 Table 5. Presentation of risk of bias/concern for applicability analysis results.

Patient Selection
Index Test Reference Test

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.
= high risk of bias and concern for applicability; cs 2022, 12, x FOR PEER REVIEW 15 of 22 Table 5. Presentation of risk of bias/concern for applicability analysis results.

Patient Selection
Index Test Reference Test

Index Test
Summary of risk of bias and concern for applicability for all included studies. = low risk of bias and concern for applicability; = high risk of bias and concern for applicability; = unclear risk of bias and concern for applicability.
= unclear risk of bias and concern for applicability. patient populations in eighteen studies, arbitrary classification of equivocal diseases in two studies, large threshold values and the exclusion of smaller lesions in two studies, and automatically generated reference labels in two studies. Only two studies were assessed with low or unclear risk of bias and concern for applicability in all domains.
No meta-analysis was conducted due to inherent heterogeneities in study tasks, population characteristics, and performance metrics. Figure 2. Summary of risk of bias and concern for the applicability of included studies. Figure 2. Summary of risk of bias and concern for the applicability of included studies.
Significant risks of bias and concern for applicability were seen in the domain of patient selection, index test, and reference. Reasons for these include a lack of consecutive patient populations in eighteen studies, arbitrary classification of equivocal diseases in two studies, large threshold values and the exclusion of smaller lesions in two studies, and automatically generated reference labels in two studies. Only two studies were assessed with low or unclear risk of bias and concern for applicability in all domains.
No meta-analysis was conducted due to inherent heterogeneities in study tasks, population characteristics, and performance metrics.

Discussion
In this systematic review, we found that the included algorithms varied considerably in terms of tasks, data requirements, and applicability to workflow improvement. With respect to patient selection, index tests, and reference tests, a significant risk of bias was seen. Most (63%) surveyed algorithms were developed using public datasets derived from ML development challenges. This largely explains the following observed patterns: data size restricted to a few hundred patients, specific disease distribution in multiple datasets, and algorithm inference capabilities based on a limited amount of MR scan sequences. T2 and T2-FLAIR were the most frequently used sequences for disease identification of multiple brain diseases. However, this observation might be confounded by the usage of public datasets. Deep neural networks and derivatives thereof were the most frequently applied ML algorithms, which might be due to their proven high performance and robust feature input methods [47]. All studies published in clinical journals used private datasets with larger patient populations. All studies that satisfied more than one workflow applicability requirement likewise used private datasets. This observed pattern of private dataset usage fits into the general trend, where promising ML algorithms are validated and regulatorily approved for clinical usage based on retrospective, unpublished, and often proprietary data from a single institution [48].
Performance results varied considerably as well. About half of the algorithms exceeded the pre-defined threshold of 85% of their respective performance metrics. Disease segmentation performance was generally lower due to the complexity of this task. These results are corroborated by similar reviews performed by Zhang et al. [49] and van Kempen et al. [50] focusing on ischemic stroke and glioma segmentation, respectively. Similar performance levels in relation to triaging performance were also observed across other imaging modalities. Hickman et al. for instance demonstrated pooled AUC, sensitivity, and specificity of 0.89, 75.4%, and 90.6%, respectively, for screening and triaging mammography using machine learning techniques [51], which are in line with what is observed in this review. Hence, consistent performance results are reported across multiple imaging modalities when using similar methods. [19,52]. However, large performance gaps were seen across clinical settings and study designs, partially owing to the well-documented effect of domain shift [53]. For example, Gauriau et al. [33] tested an algorithm with moderately low sensitiv-ity and specificity of 77% and 65%, respectively. These results were, however, attained on a large out-of-distribution dataset with a comprehensive representation of almost all diseases seen in everyday clinical practice. On the other hand, the algorithm developed by Lu, Lu et Zhang [40] achieved a binary classification accuracy, sensitivity, and specificity of 100%, but this was achieved on a very small subset of 87 2D MR-slices split out from the same data source as the training data and not reflecting clinical practice. These findings support the approach of considering multiple requirements for study design, study population, testing strategies, and performance when assessing benefits and limitations of applying ML algorithms into existing workflows.

Potential Benefits of Integrating ML into Existing Scan-and Interpretation Workflows
ML algorithms for clinical workflow integrations have been studied extensively in the past years with multiple authors suggesting different applications [11,12,52,54,55]. Olthof et al. suggest that radiologist workflows could be supported, extended, or replaced by ML functionalities [56].
Based on the findings in this review, scan acquisitions workflows could be supported by multiclass classification and segmentation algorithms. These algorithms, using only a few scan sequences acquired at the beginning of the scan acquisition process, could help classify initial scan images into different disease categories while the patient is still in the scanner and subsequently direct further scan acquisition based on real-time findings. This could prevent the excessive scanning of patients with no significant findings while ensuring fast scan acquisition for stroke patients and appropriate scan acquisition for tumor patients. The fact that 42% of the ML algorithms included in this review could successfully perform multiclass classification and segmentation based on a single MR sequence supports the feasibility of this concept.
Scan interpretation workflows, on the other hand, could be supported by all algorithms in this review. In fact, some of the surveyed binary classification studies aimed explicitly to support interpretation workflows by doing worklist prioritization of critical findings [33,41,45] and, hence, ensure faster reporting times and improved patient outcomes. Multi-class classification and segmentation algorithms could extend this further by offering potential automated diagnosis reporting, biomarker quantification, and even disease progression predictions. None of the surveyed algorithms, however, satisfied all requirements for successful workflow improvements due to key limitations.

Limitations of Included Studies and Future Directions
Important limitations pertaining to study design, data source, model development, and testing methodologies were uncovered using an analysis of the risk of bias/concern for applicability and applicability assessment for workflow improvements. First, patient selections were not consecutive and largely based on public datasets that consisted of imaging cases with high signal-to-noise ratios and selection biases. This is especially true for the BRATS challenge dataset, which is known to have handpicked and well-processed representations of brain gliomas that are very characteristic and visually recognizable, thus resulting in many algorithms achieving good performance when being developed and tested on it [57]. This could potentially introduce an overestimation of model performance and limit integration into clinical practices that face more heterogeneous images of brain diseases. Secondly, index tests were limited by insufficient reporting of model thresholds or deliberately large thresholds chosen for favorable performance reporting. Nael et al. [41], for instance, demonstrated that their model performance dropped significantly when detecting a smaller infarction volume of <0.25 mL compared to volumes of 1 mL. Because the accurate delineation of size, location and development of ischemic lesions have great prognostic implications [58], this trend of size-dependent accuracy could pose challenges to performing accurate recovery predictions and, hence, overall stroke management. Thirdly, reference tests similarly introduced critical biases, especially in the included studies that used 2Dimage datasets with handpicked 2D-images and labels as ground truth. This selection of representative images could have introduced priors that are easily exploitable by ML algorithms, as has previously been demonstrated in similar datasets [59]. Fourthly, about half of the surveyed ML algorithms had unacceptably low sensitivity and specificity, which could increase scan acquisition workloads and more worryingly decrease patient safety. Finally, only a minor proportion reported on the clinically relevant metrics of positive and negative predictive values. This, combined with the lack of testing on out-of-distribution datasets, might have presented skewed performance impressions not accounting for all relevant conditions in the intended target population [13].
Future studies developing ML algorithms applicable for workflow improvements should ensure the possession of a consecutive patient population reflecting the desired target population, transparent reporting of patient population characteristics and thresholds for index tests, and performance levels reported through metrics that incorporate different aspects of positive and negative findings. Low false-negative rates should be prioritized, thus ensuring adequate patient safety by having the fewest possible missed findings. Disease prevalence must be considered so as to account for positive and negative predictive values. To alleviate some of these limitations, standardized reporting guidelines tailored for AI in radiology [60], prospective study designs with consecutive patient sampling, and multi-site testing with clinical partners must be considered. The challenges of low sensitivity and specificity might be addressed by rethinking existing data acquisition strategies and model architectures. For instance, temporal information from follow-up scans or contrast-enhancement kinetics can be taken into account. Similar strategies are being used on PET-CT scans resulting in improved tumor classification specificity [61].

Limitations of This Review
This review should be read in view of limitations including publication and reporting bias. We limited our inclusion criteria to studies that could identify multiple brain diseases including brain infarcts, hemorrhages, or tumors, and we further restricted our limitation to studies that have tested their algorithms on data separate from training data. Next, we assessed the applicability of ML algorithms for improving workflows based on a set of requirements not previously validated. All of this might have limited the overview and the impression of this research field. As these criteria were selected based on clinical relevance, the results nonetheless present clinically useful representations of how state-of-the-art ML algorithms could be applied to improve existing scan acquisition and interpretation workflows.

Conclusions
The surveyed algorithms could potentially support and extend existing workflows. However, limitations pertaining to study design, study data, reference standards, and performance reporting prevent clinical integration. No study satisfied all requirements for successful workflow integration. Standardized reporting guidelines tailored for ML in radiology, prospective study designs, and multi-site testing could help alleviate this. The findings from this review could aid future researchers and healthcare providers by allowing them to critically assess relevant ML studies for workflow improvements and by enabling them to better design studies that validate the benefits of deploying ML in scan acquisition and interpretation workflows. EMBASE exp magnetic resonance imaging/ exp brain disease/ exp machine learning/ exp classification/or detection.mp.
[mp = title, abstract, heading word, drug trade name, original title, device manufacturer, drug manufacturer, device trade name, keyword heading word, floating subheading word, candidate term word] 1 and 2 and 3 and 4 and 5

Scopus
(TITLE-ABS-KEY ("magnetic resonance imaging" OR "Multiparametric Magnetic Resonance Imaging" OR "MRI")) AND (TITLE-ABS-KEY ("Artificial Intelligence" OR "Machine Learning" OR "Deep Learning" OR "Neural Network" OR "Convolutional neural network")) AND (TITLE-ABS-KEY (brain AND disease OR brain AND infarct OR brain AND hemorrhage OR brain AND neoplasm* OR brain AND tumor OR brain AND anomal* OR brain AND abnormal* OR brain AND patholog* OR brain "multi-class" OR brain "critical finding*" OR brain AND triag* OR brain AND automat*)) AND (TITLE-ABS-KEY (classification OR detection)) AND Web of Science ALL = (("magnetic resonance imaging" OR "Multiparametric Magnetic Resonance Imaging" OR "MRI") AND ("Artificial Intelligence" OR "Machine Learning" OR "Deep Learning" OR "Neural Network" OR "Convolutional neural network") AND (brain disease OR brain anomal* OR brain abnormal* OR brain patholog* OR brain "multi-class" OR brain "critical finding*" or brain triag* OR brain automat* OR brain infarct OR brain hemorrhage OR "intraparenchymal hemorrhage" OR brain neoplasm OR brain tumor) AND (classification OR detection)) IEEE Xplore ("magnetic resonance imaging" OR "Multiparametric Magnetic Resonance Imaging" OR "MRI") AND ("Artificial Intelligence" OR "Machine Learning" OR "Deep Learning" OR "Neural Network" OR "Convolutional neural network") AND (brain disease OR brain anomaly OR brain abnormality OR brain pathology OR brain "multi-class" OR brain "critical finding*" OR brain triage OR brain infarct OR brain hemorrhage OR brain neoplasm OR brain tumor) AND (classification OR detection)