Deep Learning for Detecting Brain Metastases on MRI: A Systematic Review and Meta-Analysis

Simple Summary Manual detection and delineation of brain metastases are time consuming and variable. Studies have therefore been conducted to automate this process using imaging studies and artificial intelligence. To the best of our knowledge, no study has conducted a systematic review and meta-analysis on brain metastasis detection using only deep learning and MRI. As a result, a systematic review of this topic is required, as well as an assessment of the quality of the studies and a meta-analysis to determine the strength of the current evidence. The purpose of this study was to perform a systematic review and meta-analysis of the performance of deep learning models that use MRI to detect brain metastases in cancer patients. Abstract Since manual detection of brain metastases (BMs) is time consuming, studies have been conducted to automate this process using deep learning. The purpose of this study was to conduct a systematic review and meta-analysis of the performance of deep learning models that use magnetic resonance imaging (MRI) to detect BMs in cancer patients. A systematic search of MEDLINE, EMBASE, and Web of Science was conducted until 30 September 2022. Inclusion criteria were: patients with BMs; deep learning using MRI images was applied to detect the BMs; sufficient data were present in terms of detective performance; original research articles. Exclusion criteria were: reviews, letters, guidelines, editorials, or errata; case reports or series with less than 20 patients; studies with overlapping cohorts; insufficient data in terms of detective performance; machine learning was used to detect BMs; articles not written in English. Quality Assessment of Diagnostic Accuracy Studies-2 and Checklist for Artificial Intelligence in Medical Imaging was used to assess the quality. Finally, 24 eligible studies were identified for the quantitative analysis. The pooled proportion of patient-wise and lesion-wise detectability was 89%. Articles should adhere to the checklists more strictly. Deep learning algorithms effectively detect BMs. Pooled analysis of false positive rates could not be estimated due to reporting differences.


Introduction
Brain metastases (BMs) are observed in nearly 20% of adult cancer patients [1]. They are the most common type of intracranial neoplasm in adults [2]. Although the brain parenchyma is the most common intracranial metastatic site, BMs frequently occur in conjunction with metastases to other sites such as the cranium, dura, or leptomeninges [3]. Therefore, making accurate diagnosis of BM is critical.

Study Selection
All search results were exported to the Rayyan online platform [16]. After removing duplicates, two authors (B.B.O. and M.K.) independently screened titles and abstracts using the Rayyan platform and reviewed the full text of potentially relevant articles. A senior author compared and examined the results of each search and analysis step (M.W., 25 years of experience in neuroradiology). Any disagreement was resolved via discussion with the senior author (M.W.).
Articles were included based on the satisfaction of all the following criteria: (I) inclusion of patients with BMs; (II) deep learning using MRI images was applied to detect the BMs; (III) sufficient data were present in terms of detective performance of the deep learning algorithms; (IV) original research articles.
Articles were excluded if they fulfilled any of the following criteria: (I) reviews, letters, guidelines, editorials, or errata; (II) case reports or series with less than 20 patients; (III) studies with overlapping cohorts; (IV) insufficient data in terms of detective performance of the deep learning algorithms; (V) machine learning, not deep learning, was used to detect BMs; (VI) articles not written in English. In the case of overlapping cohorts, the article with the largest sample size was included. If articles have the same number of patients, the most recent one was preferred.

Data Extraction
The following variables were collected by the two authors (B.B.O. and M.K.): (I) study characteristics (first author, publication year, study design, number of patients in each dataset, sex of patients in each dataset, number of metastatic lesions in each dataset, mean or median size of lesions, reference standard for metastasis detection, validation method, and primary tumors); (II) deep learning details and statistics (detectability, false positive rate, deep learning algorithm, and data augmentation); (III) MRI and scanner characteristics; and (IV) inclusion and exclusion criteria of each study.

Quality Assessment
Two authors (B.B.O. and M.K.) independently performed the quality assessments using the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) and Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) [17,18]. The CLAIM is a new 42-item checklist for evaluating artificial intelligence studies in medical imaging. For each item, studies were given a score of 0 or 1 on a 2-point scale. The CLAIM score was calculated by adding the scores from each study. The items were all equally weighted. QUADAS-2 was used to assess four domains: (I) patient selection, (II) index test, (III) reference standard, and (IV) flow and timing. During the quality assessment, any disagreements were resolved with the assistance of the senior author.

Meta-Analysis
The primary goal was to assess the detectability (sensitivity) of deep learning algorithms in detecting BMs. For the pooled proportion analysis of detectability of all included studies in the meta-analysis, the reported sensitivity per study was multiplied by the total number of patients in each study to estimate the number of detected events. In other words, reported sensitivity was converted to patient-wise sensitivity. Another pooled proportion analysis of detectability was performed for studies reporting their sensitivity lesion-wise. The detectability of each study in this group was calculated by dividing the number of correctly identified metastases in the test set (using given true positive values or calculated with reported sensitivities) by the total number of metastases in the test set. Following statistical analyses were performed in both pooled analyses.
The pooled proportion analysis of detectability estimates with 95% CIs was performed with the random-effects model. The random-effects model, using the inverse variance method, was chosen since it captures uncertainty due to heterogeneity among studies [19,20]. The Freeman-Tukey double arcsine transformation was performed to stabilize the variances before pooling. Forest plots were created to provide a visual representation of the results. Metaregression analyses were performed to determine if training size was associated with the sensitivity. Heterogeneity among all included studies was evaluated using Q-test with p < 0.05, suggesting the presence of study heterogeneity and I2 statistics. I2 values were defined as follows: heterogeneity that might not be important (0-25%), low heterogeneity (26-50%), moderate heterogeneity (51-75%), and high heterogeneity (76-100%) [21]. Since heterogeneity might indicate subgroup effects, we also explored heterogeneity in the pooled results using subgroup analysis [22]. Subgroup analyses were conducted in terms of the validation method used in the study (split training-test sets versus cross-validation), the plurality of MRI sequences utilized (single sequence versus multi-sequence), the calculation method of the reported sensitivity rates (lesion-wise versus other [patient-wise, voxel-wise]), the plurality of the primary tumor type (single versus multiple), dimension of the included images (2D versus 3D versus both), study design (single-center versus multi-center), and deep learning algorithms (DeepMedic versus U-Net versus others)- Figure 1. Meta-regression analyses were also performed to determine if training size was associated with heterogeneity. Publication bias occurs when the findings of a study influence the study's likelihood of publication. The Egger method was applied to test the funnel plot asymmetry for publication bias [23]. R version 4.2.1 was used for all statistical analysis, and the function metaprop from package meta was utilized to perform meta-analysis and generate pooled estimates [24]. An alpha level of 0.05 was considered statistically significant.
or calculated with reported sensitivities) by the total number of metastases in the test set. Following statistical analyses were performed in both pooled analyses.
The pooled proportion analysis of detectability estimates with 95% CIs was performed with the random-effects model. The random-effects model, using the inverse variance method, was chosen since it captures uncertainty due to heterogeneity among studies [19,20]. The Freeman-Tukey double arcsine transformation was performed to stabilize the variances before pooling. Forest plots were created to provide a visual representation of the results. Meta-regression analyses were performed to determine if training size was associated with the sensitivity. Heterogeneity among all included studies was evaluated using Q-test with p < 0.05, suggesting the presence of study heterogeneity and I2 statistics. I2 values were defined as follows: heterogeneity that might not be important (0-25%), low heterogeneity (26-50%), moderate heterogeneity (51-75%), and high heterogeneity (76-100%) [21]. Since heterogeneity might indicate subgroup effects, we also explored heterogeneity in the pooled results using subgroup analysis [22]. Subgroup analyses were conducted in terms of the validation method used in the study (split training-test sets versus cross-validation), the plurality of MRI sequences utilized (single sequence versus multisequence), the calculation method of the reported sensitivity rates (lesion-wise versus other [patient-wise, voxel-wise]), the plurality of the primary tumor type (single versus multiple), dimension of the included images (2D versus 3D versus both), study design (single-center versus multi-center), and deep learning algorithms (DeepMedic versus U-Net versus others)- Figure 1. Meta-regression analyses were also performed to determine if training size was associated with heterogeneity. Publication bias occurs when the findings of a study influence the study's likelihood of publication. The Egger method was applied to test the funnel plot asymmetry for publication bias [23]. R version 4.2.1 was used for all statistical analysis, and the function metaprop from package meta was utilized to perform meta-analysis and generate pooled estimates [24]. An alpha level of 0.05 was considered statistically significant.

Literature Search
The study selection process is shown in the PRISMA flowchart ( Figure 2). The literature search yielded 401 studies: 90 from MEDLINE, 189 from EMBASE, and 122 from Web of Science. After removing 191 duplicate articles, the remaining 210 were screened based on their title and abstract on the Rayyan platform, and 111 were excluded. Full texts of the remaining 99 articles were acquired and reviewed.

Literature Search
The study selection process is shown in the PRISMA flowchart ( Figure 2). The literature search yielded 401 studies: 90 from MEDLINE, 189 from EMBASE, and 122 from Web of Science. After removing 191 duplicate articles, the remaining 210 were screened based on their title and abstract on the Rayyan platform, and 111 were excluded. Full texts of the remaining 99 articles were acquired and reviewed.
Seventy-four articles were excluded because: irrelevant studies or studies lacking significant amount of data (n = 15); overlapping patient cohort (n = 4); reviews, letters, guidelines, editorials, conference abstracts, and poster presentations (n= 30); focused on glioma and metastasis differentiation (n = 12); focused on segmentation performance (n = 5); another imaging modality was used rather than MRI (n = 6); machine learning was used, not deep learning (n = 2).
Finally, 25 eligible studies were identified [14,. Due to a lack of sensitivity reporting, one study was included in the review but not in the quantitative analysis. Seventy-four articles were excluded because: irrelevant studies or studies lacking significant amount of data (n = 15); overlapping patient cohort (n = 4); reviews, letters, guidelines, editorials, conference abstracts, and poster presentations (n= 30); focused on glioma and metastasis differentiation (n = 12); focused on segmentation performance (n = 5); another imaging modality was used rather than MRI (n = 6); machine learning was used, not deep learning (n = 2).
Finally, 25 eligible studies were identified [14,. Due to a lack of sensitivity reporting, one study was included in the review but not in the quantitative analysis. Table 1 shows a quality assessment summary of the included studies using the CLAIM. The mean CLAIM score of the four studies was 26.56 with a standard deviation (SD) of 3.19 (range, 18.00-31.00). The mean scores of the subsections of the CLAIM were 1.52 (SD = 0.71) for the title/abstract section, 2.00 (SD = 0.00) for the introduction section, 18.16 (SD = 2.25) for the methods section, 2.00 (SD = 0.82) for the results section, 1.96 (SD = 0.20) for the discussion section, and 0.92 (SD = 0.28) for the other information section.   Table 1 shows a quality assessment summary of the included studies using the CLAIM. The mean CLAIM score of the four studies was 26.56 with a standard deviation (SD) of 3.19 (range, 18.00-31.00). The mean scores of the subsections of the CLAIM were 1.52 (SD = 0.71) for the title/abstract section, 2.00 (SD = 0.00) for the introduction section, 18.16 (SD = 2.25) for the methods section, 2.00 (SD = 0.82) for the results section, 1.96 (SD = 0.20) for the discussion section, and 0.92 (SD = 0.28) for the other information section. Figure 3 illustrates a quality assessment summary of the included studies using the QUADAS-2 tool. In terms of patient selection, eight studies showed an unclear risk of bias because they did not specify the inclusion criteria for patient enrollment, or they did not report the excluded patients or exclusion criteria [26,30,[33][34][35]41,45,47]. Six studies were found to have a high risk of bias due to patient exclusion based on lesion size or the number of lesions [25,29,31,32,37,38]. The inclusion and exclusion criteria of the studies can be found online in Supplementary Table S3. All included studies in the index test section were determined to have a low bias risk since the algorithm was blinded to the reference standard. Since the lesions were annotated on contrast-enhanced MRI manually or semi-manually with manual correction, which is the method of choice for assessing and delineating brain metastases, all studies in the reference standard section were considered to have a low risk of bias [5]. No risk of bias was found concerning the flow and timing. There were no concerns regarding the applicability of patient selection, and index tests. Two studies were found to have high concerns about the applicability of the reference test because they used 1.0 Tesla (T) scanners [36,40]. In current practice, scanners with magnets less than 1.5T are not recommended for BM detection [49]. have a low risk of bias [5]. No risk of bias was found concerning the flow and timing. There were no concerns regarding the applicability of patient selection, and index tests. Two studies were found to have high concerns about the applicability of the reference test because they used 1.0 Tesla (T) scanners [36,40]. In current practice, scanners with magnets less than 1.5T are not recommended for BM detection [49].

Characteristics of Included Studies
The patient and study characteristics are shown in Table 2. All studies were conducted retrospectively. Five studies were multi-center studies [14,32,36,43,44], and the others were single-center. Four studies used cross-validation [30,31,38,43], and others used split training-test sets to evaluate their models. Delineation of brain metastases was done semi-automatically in one study to serve as a reference standard [25], and it was done manually in others. There were 6840, 1419, and 643 patients in the training and validation sets combined, test sets, and other sets, respectively. Ultimately, 25 studies included a total of 8902 patients. The number of metastatic lesions in the training sets was reported in 20 studies, totaling 31,530. Furthermore, the number of metastatic lesions in the test sets was documented in 18 studies, making a total of 5565. The total reported number of metastatic lesions was 40,654. Twenty-two studies included multiple primary tumor types, and three studies included only one primary tumor type. Two articles included only malignant melanoma patients, and one included non-small cell lung cancer (NSCLC) patients [30,36,40]. Table 2 also displays the means or medians of the volumes of the lesions or the longest diameter of the lesions.

Characteristics of Included Studies
The patient and study characteristics are shown in Table 2. All studies were conducted retrospectively. Five studies were multi-center studies [14,32,36,43,44], and the others were single-center. Four studies used cross-validation [30,31,38,43], and others used split training-test sets to evaluate their models. Delineation of brain metastases was done semi-automatically in one study to serve as a reference standard [25], and it was done manually in others. There were 6840, 1419, and 643 patients in the training and validation sets combined, test sets, and other sets, respectively. Ultimately, 25 studies included a total of 8902 patients. The number of metastatic lesions in the training sets was reported in 20 studies, totaling 31,530. Furthermore, the number of metastatic lesions in the test sets was documented in 18 studies, making a total of 5565. The total reported number of metastatic lesions was 40,654. Twenty-two studies included multiple primary tumor types, and three studies included only one primary tumor type. Two articles included only malignant melanoma patients, and one included non-small cell lung cancer (NSCLC) patients [30,36,40]. Table 2 also displays the means or medians of the volumes of the lesions or the longest diameter of the lesions.   1 The total dataset contains 291 females and 239 males. 2 The combined training and validation sets contain 1460 lesions. 3 Unlabeled data contained 867 patients. 4 Validation and testing sets included 35 female and 30 male patients combined. 5 Synthetic images are not included in our study. 6 The authors stated, "this study included imaging data from referring institutions, it does not approximate to a true multicenter approach" in the limitations. 7 The test group included 17 patients with brain metastases and 17 without enhancing lesions. 8 This study is not included in the quantitative analysis since they did not report the sensitivity. 9 Fifty-nine patients do not have brain metastases. 10 Due to image anonymization, gender information was missing in 49 patients. 11 The test set included 45 patients with brain metastases and 49 patients without. 12 The total dataset contains 30 females and 39 males. 13 The sets included 269, 108, and 529 patients without brain metastases, respectively. 14 The whole dataset included 878 metastatic lesions. 15 The total dataset contains 46 males and 75 females. 16 The total dataset contains 460 males and 474 females. 17 The mean size of the metastases in the entire dataset was 9 mm.

Assessment of Detectability Performance
In our study, we included the internal test set results or the cross-validation results in the pooled detectability analysis. If more than one algorithm was used in a study, we utilized the best algorithm in our pooled analysis. Furthermore, if more than one level of input was present, the one that yielded the most successful outcome was included in our study.

Patient-Wise
Twenty studies evaluated and reported their detectability lesion-wise, whereas the remaining four assessed and reported their detectability voxel-wise or patient-wise. The detectability of the 24 included studies ranged from 58% to 98%. The pooled proportion of detectability (patient-wise) of deep learning algorithms in all 24 included studies was 89% (95% CI, 84-92%) (Figure 4). The meta-regression did not find a significant impact of the training sample size on the sensitivity (p = 0.12). The Q-test indicated that heterogeneity was present across the studies (Q = 111.13, p < 0.01), and the Higgins I 2 statistic showed the presence of high heterogeneity in detectability (I 2 = 79%).  There was no statistically significant difference in performance between the groups in any subgroup analyses (p-values ranged from 0.08 to 0.69). Heterogeneity was moderate or high in subgroups that are separated based on the plurality of MRI sequences utilized (single MRI sequence; Q = 50.86, p < 0.01; I 2 = 72%/multiple MRI sequences; Q = 33.09, p < 0.01; I 2 = 76%), validation method (split training-test; Q = 83.18, p < 0.01; I 2 = 76%/crossvalidation; Q = 11.63, p < 0.01; I 2 = 83%), study design (single-center; Q = 52.65, p < 0.01; I 2 = 66%/multi-center; Q = 36.83, p < 0.01; I 2 = 89%), and the calculation method of the reported sensitivity (lesion-wise; Q = 68.78, p < 0.01; I 2 = 72%/others; Q = 37.60, p < 0.01; I 2 = 92%). Subgroup analyses based on the plurality of the primary tumor type revealed a high heterogeneity in studies with multiple primary tumor types (Q = 105.92, p < 0.01; I 2 = 81%). There was no evidence of heterogeneity in studies with a single primary tumor type (Q = 0.66, p = 0.72; I 2 = 0%). Furthermore, subgroup analyses based on the dimension of the included images revealed high heterogeneity in studies with 3D images (Q = 94.2, p < 0.01; I 2 = 84%). There was no evidence of heterogeneity in studies with a mixture of 2D and 3D images (Q = 9.64, p = 0.09; I 2 = 48%) and studies with 2D images (Q = 0.72, p = 0.39; I 2 = 0%). The meta-regression revealed that the heterogeneity not explained by the training sample size was significant, indicating that the training sample size did not influence heterogeneity significantly (p < 0.0001).
The funnel plot was asymmetrical, indicating publication bias among the included studies ( Figure 5). Furthermore, not all studies were plotted within the area under the curve of the pseudo-95% CI, showing a possible publication bias [50]. However, the Egger test did not indicate obvious publication bias (regression intercept = 1.32, p = 0.15). There was no statistically significant difference in performance between the groups in any subgroup analyses (p-values ranged from 0.08 to 0.69). Heterogeneity was moderate or high in subgroups that are separated based on the plurality of MRI sequences utilized (single MRI sequence; Q = 50.86, p < 0.01; I 2 = 72%/multiple MRI sequences; Q = 33.09, p < 0.01; I 2 = 76%), validation method (split training-test; Q = 83.18, p < 0.01; I 2 = 76%/crossvalidation; Q = 11.63, p < 0.01; I 2 = 83%), study design (single-center; Q = 52.65, p < 0.01; I 2 = 66%/multi-center; Q = 36.83, p < 0.01; I 2 = 89%), and the calculation method of the reported sensitivity (lesion-wise; Q = 68.78, p < 0.01; I 2 = 72%/others; Q = 37.60, p < 0.01; I 2 = 92%). Subgroup analyses based on the plurality of the primary tumor type revealed a high heterogeneity in studies with multiple primary tumor types (Q = 105.92, p < 0.01; I 2 = 81%). There was no evidence of heterogeneity in studies with a single primary tumor type (Q = 0.66, p = 0.72; I 2 = 0%). Furthermore, subgroup analyses based on the dimension of the included images revealed high heterogeneity in studies with 3D images (Q = 94.2, p < 0.01; I 2 = 84%). There was no evidence of heterogeneity in studies with a mixture of 2D and 3D images (Q = 9.64, p = 0.09; I 2 = 48%) and studies with 2D images (Q = 0.72, p = 0.39; I 2 = 0%). The meta-regression revealed that the heterogeneity not explained by the training sample size was significant, indicating that the training sample size did not influence heterogeneity significantly (p < 0.0001).
The funnel plot was asymmetrical, indicating publication bias among the included studies ( Figure 5). Furthermore, not all studies were plotted within the area under the curve of the pseudo-95% CI, showing a possible publication bias [50]. However, the Egger test did not indicate obvious publication bias (regression intercept = 1.32, p = 0.15).

Lesion-Wise
Among 20 studies evaluated and reported their detectability lesion-wise, 18 of them reported the number of metastases in the test sets. All studies in this group used split training-test sets to evaluate their models. The detectability of the 18 included studies ranged from 58% to 98%. The pooled proportion of detectability of deep learning algorithms (lesion-wise) was 89% (95% CI, 83-93%) ( Figure 6). The meta-regression did not find a significant impact of the training sample size on the sensitivity (p = 0.86). The Q-test indicated that heterogeneity was present across the studies (Q = 551.71, p < 0.01), and the Higgins I 2 statistic showed the presence of high heterogeneity in detectability (I 2 = 97%).

Lesion-Wise
Among 20 studies evaluated and reported their detectability lesion-wise, 18 of them reported the number of metastases in the test sets. All studies in this group used split training-test sets to evaluate their models. The detectability of the 18 included studies ranged from 58% to 98%. The pooled proportion of detectability of deep learning algorithms (lesion-wise) was 89% (95% CI, 83-93%) ( Figure 6). The meta-regression did not find a significant impact of the training sample size on the sensitivity (p = 0.86). The Q-test indicated that heterogeneity was present across the studies (Q = 551.71, p < 0.01), and the Higgins I 2 statistic showed the presence of high heterogeneity in detectability (I 2 = 97%).

Lesion-Wise
Among 20 studies evaluated and reported their detectability lesion-wise, 18 of them reported the number of metastases in the test sets. All studies in this group used split training-test sets to evaluate their models. The detectability of the 18 included studies ranged from 58% to 98%. The pooled proportion of detectability of deep learning algorithms (lesion-wise) was 89% (95% CI, 83-93%) ( Figure 6). The meta-regression did not find a significant impact of the training sample size on the sensitivity (p = 0.86). The Q-test indicated that heterogeneity was present across the studies (Q = 551.71, p < 0.01), and the Higgins I 2 statistic showed the presence of high heterogeneity in detectability (I 2 = 97%).  There was no statistically significant difference in performance between the groups in any subgroup analyses (p-values ranged from 0.26 to 0.68). Heterogeneity was high in subgroups that were separated based on the plurality of MRI sequences (single MRI sequence; Q = 370.72, p < 0.01; I 2 = 98%/multiple MRI sequences; Q = 112.75, p < 0.01; I 2 = 94%), study design (single-center; Q = 439.53, p < 0.01; I 2 = 97%/multi-center; Q = 17.01, p < 0.01; I 2 = 88%), and the dimension of the included images (3D images; Q = 529.39, p < 0.01; I 2 = 98%/both 2D and 3D images; Q = 21.47, p < 0.01; I 2 = 77%). Subgroup analyses based on the plurality of the primary tumor type revealed a high heterogeneity in studies with multiple primary tumor types (Q = 551.3, p < 0.01; I 2 = 97%). There was no evidence of heterogeneity in studies with a single primary tumor type (Q = 0.06, p = 0.81; I 2 = 0%). In this group, two subgroup analyses to investigate heterogeneity could not be carried out: based on validation method (all studies used split training-test method) and detection level (all studies reported lesion-wise sensitivity). The meta-regression showed that the heterogeneity not explained by the training sample size was significant, indicating that the training sample size had no significant effect on heterogeneity (p < 0.0001).
The asymmetrical funnel plot indicated publication bias among the included studies in this group (Figure 7). In addition, only some studies were plotted within the area under the curve of the pseudo-95% CI, indicating possible publication bias [50]. On the other hand, the Egger test revealed no obvious publication bias (regression intercept = 1.17, p = 0.66). There was no statistically significant difference in performance between the groups in any subgroup analyses (p-values ranged from 0.26 to 0.68). Heterogeneity was high in subgroups that were separated based on the plurality of MRI sequences (single MRI sequence; Q = 370.72, p < 0.01; I 2 = 98%/multiple MRI sequences; Q = 112.75, p < 0.01; I 2 = 94%), study design (single-center; Q = 439.53, p < 0.01; I 2 = 97%/multi-center; Q = 17.01, p < 0.01; I 2 = 88%), and the dimension of the included images (3D images; Q = 529.39, p < 0.01; I 2 = 98%/both 2D and 3D images; Q = 21.47, p < 0.01; I 2 = 77%). Subgroup analyses based on the plurality of the primary tumor type revealed a high heterogeneity in studies with multiple primary tumor types (Q = 551.3, p < 0.01; I 2 = 97%). There was no evidence of heterogeneity in studies with a single primary tumor type (Q = 0.06, p = 0.81; I 2 = 0%). In this group, two subgroup analyses to investigate heterogeneity could not be carried out: based on validation method (all studies used split training-test method) and detection level (all studies reported lesion-wise sensitivity). The meta-regression showed that the heterogeneity not explained by the training sample size was significant, indicating that the training sample size had no significant effect on heterogeneity (p < 0.0001).
The asymmetrical funnel plot indicated publication bias among the included studies in this group (Figure 7). In addition, only some studies were plotted within the area under the curve of the pseudo-95% CI, indicating possible publication bias [50]. On the other hand, the Egger test revealed no obvious publication bias (regression intercept = 1.17, p = 0.66).

Discussion
BMs are ten times more common than primary malignant brain tumors, and patients with a history of BM should be followed up by imaging every three months and whenever clinically indicated [51,52]. Therefore, there is a massive demand for radiologists to detect and follow-up on these lesions. However, radiologists face several challenges in detecting BMs accurately. Among the challenges are a massive workload, difficulties in differentiation of BMs from noises and blood vessels, BMs with small sizes, significant variations in

Discussion
BMs are ten times more common than primary malignant brain tumors, and patients with a history of BM should be followed up by imaging every three months and whenever clinically indicated [51,52]. Therefore, there is a massive demand for radiologists to detect and follow-up on these lesions. However, radiologists face several challenges in detecting BMs accurately. Among the challenges are a massive workload, difficulties in differentiation of BMs from noises and blood vessels, BMs with small sizes, significant variations in lesion shape and size among patients, weak signal intensities, and multiple locations and lesions in a patient [11]. As a result of these challenges, the use of deep learning methods in detecting BMs has recently increased in research. Because the recent meta-analysis by Cho et al. included just 12 studies, only 7 of which were deep learning studies, we decided to conduct a systematic review and assess the strength of the present evidence [11]. Their study also found a statistically significant difference in false positives between the deep learning and machine learning groups, necessitating the focused study on each group. Our analysis showed that studies with deep learning models using MR images perform well in detecting BMs, with a pooled sensitivity of 89% (95% CI, 84-92%). The detectability ranged from 58% to 98%.
The size of the lesions is an important determinant of the success of detecting BMs with deep learning algorithms [53]. For instance, in a study by Rudie et al., detection sensitivity for BMs smaller than 3 mm was only 14.6%, and sensitivity for BMs greater than 3 mm was 84.3% [42]. In another study, the detection sensitivity was reduced by 71% for BMs < 3 mm compared to all BMs [48]. Various approaches to solving this problem have been proposed in the literature. Amemiya et al. reinforced their SSD model for small lesion detections with feature fusion and managed to increase its sensitivity from 35.4% to 45.8% for lesions smaller than 3 mm [25]. In another study, Bousabarah et al. showed that the sensitivity of their models for small BMs ranged from 0.40 to 0.51 [26]. They trained another model on a subsample containing only small BMs (smaller than 0.4 mL), showing an increased sensitivity of 0.62-0.68 for small BMs. It is worth noting that this model's sensitivity was 0.43-0.53 for all sizes of BMs. As stated, Dikici et al. used a sample of smaller BMs and acquired results similar to theirs [54]. Although training deep learning models with small BMs would help to improve sensitivity in detecting BMs, there are concerns when applying this method to a heterogeneous group of BMs. Furthermore, since a meta-analysis demonstrated that black blood images successfully detect small lesions [55], Park et al. added black blood images to their model with gradient echo sequence images and increased their sensitivity by 23.5% for detecting BMs < 3 mm [39]. Kottlors et al. also showed that smaller BMs (<5 mm) could be detected better with deep learning models using black blood images [38]. Yin et al. detected BMs < 3 mm with an impressive sensitivity of 89% by only using contrast-enhanced T1WI, and having a sensitivity of 95.8% in all sizes of BMs [44]. We believe, in part, this success was due to a large number of metastases in the cohort (11514 BMs) and a high proportion of small BMs (<5 mm, 58.6%) in their sample. Therefore, more models that are trained with real-world data representation and successful on both small and large lesions are required. Moreover, surgery or SRS planning scan with a higher spatial resolution can be more sensitive for detecting metastases than a conventional scan, revealing lesions not previously detected [56]. Therefore, utilizing MRI scanners with a higher spatial resolution might improve the efficacy of deep learning models, particularly for smaller lesions. Furthermore, we also analyzed the impact of training sample size on detectability, but no statistically significant effect was observed. Studies in the literature demonstrate that the success of deep learning models typically increases with sample size [57,58]. Although our results contradict this, one possible explanation could be that even though some papers have a small number of patients, the number of lesions in such a small number of patients can be quite high. Various medical imaging modalities have unique characteristics and different reactions to different tissues in the human body. Our review showed that ten studies used multiple MRI sequences in their models whereas the remaining fifteen only used one. Our metaanalysis showed that there was no statistically significant difference between the pooled proportion of detectability of studies with a single MRI sequence and the pooled proportion of detectability of studies with multiple MRI sequences. Park et al. combined and separately used two MRI sequences in their model [39]. Gradient echo sequence and black blood images showed 76.8% and 92.6% sensitivity, respectively. Their combination showed 93.1% sensitivity. There was a statistically significant difference between the models using only gradient echo sequence and combined sequences, but no difference between the black blood model and the combined model. This was also evident in the models' sensitivity in lesions smaller than 3 mm, with the gradient echo sequence model, black blood model, and the combined model showing a sensitivity of 23.5%, 82.4%, and 82.4%, respectively. Charron et al. showed that the model combining different MRI modalities surpassed the model using single modalities in their study [27]. Contrary to these findings, our subgroup analyses showed no statistically significant difference between models using single or multiple MRI sequences. Therefore, additional research would be needed to determine which MRI scanners and scanner combinations would be the optimal strategy for automatically detecting brain metastases with deep learning. Furthermore, there was no statistically significant difference in the performance of models that included only 2D, 3D, or a combination of them; however, this comparison is not ideal due to the small number of studies that included only 2D images and studies that included both 2D and 3D images.
Most BMs are caused by lung cancer, breast cancer, melanoma, and renal cell carcinoma [2]. Our review showed that two articles included only malignant melanoma patients, and one included NSCLC patients. Others included BMs patients with any primary tumor. The appearance of BMs on MRI may vary between primary tumors, particularly in NSCLC [36,59]. Due to that, Jünger et al. stated that different deep learning models for different BMs originating from different primary tumors should be designed to obtain a satisfactory detection performance [36]. Although this may be helpful to increase the performance of the model detecting BMs originating from the particular primary tumor, the primary tumor is unknown in 5-40% of patients demonstrating the symptoms of BMs [60]. Furthermore, the primary tumor in up to 15% of BM patients cannot be identified [61]. Therefore, generalized deep learning models might be required to detect BMs with unknown primary tumors. Furthermore, we performed subgroup analyses to compare the performance of studies with a single primary tumor type and studies with multiple primary tumor types. There was no statistically significant difference in their performance; however, this comparison was not ideal due to the small number of studies with a single primary tumor type compared to studies with multiple primary tumor types. This is still an open research area, and further studies would be needed.
Besides sensitivity, false positives were the other important measure reported by the included studies. An important cause of false positives was the similarity between BMs and blood vessels, as both may present with a small focus of hyperintensity on a contrast-enhanced T1WI [46]. Several studies showed that most or at least some of their false positives were in and near vascular structures [25,32,37,39,44,46,47]. There have been several proposed solutions to this problem in the literature. Grøvik et al. hypothesized that adding other MRI sequences, such as diffusion-weighted MRI, may help deal with false positives [32]. Furthermore, black blood imaging can be helpful since it can suppress blood vessel signals, which is also increasingly applied in clinical practice [25,38,39,44]. It is worth noting that Park et al. demonstrated that when only black blood imaging is used, false positives are significantly higher than when black blood and gradient echo sequences images are used together [39]. Furthermore, Deike-Hofmann et al. demonstrated that including a pre-diagnosis scan in their model greatly reduced false positives, but including additional sequences contrarily decreased the specificity [30]. They hypothesized that this result was due to models interpreting that lesions that change over time are more likely to be BMs, whereas stable structures such as blood vessels are less likely, as humans do. Finally, skull stripping was another method that significantly reduced false positives, particularly extra-axial ones [14,25,42]. We were unable to conduct a pooled analysis of false positive rates in our study since included studies reported false positive rates differently.
Various deep learning algorithms were applied to MR images to detect BMs. Our review revealed that the most commonly used deep learning algorithm in BMs detection was U-Net, with different versions [62]. U-Net is an algorithm for semantic segmentation, also known as pixel-based classification. A contracting encoder and an expanding decoder comprise the U-Net. The expanding decoder creates the label map after the contracting encoder extracts low-level features. The second most commonly used deep learning algorithm was DeepMedic (Biomedical Image Analysis Group, Department of Computing, Imperial College London, London, UK) [63]. DeepMedic is built around a 3D deep CNN and a 3D conditional random field. Unsurprisingly, both algorithms are the most widely used since they are easily accessible. They are likely to be used more in the future, and there is room for research into combining these models in BMs detection [64]. The performance of the deep learning models was also compared. However, there was no statistically significant difference between the patient-wise sensitivity group and the lesion-wise sensitivity group (p = 0.25 and p = 0.26, respectively).
The performance of the deep learning models should be compared to the performance of radiologists in detecting BMs. Kikuchi et al. compared their model with twelve radiologists; seven were board-certified, and five were residents [37]. Their model exhibited higher sensitivity (91.7%) than the radiologists (88.7 ± 3.7%), however the article did not compare how the algorithms performed compared to faculty versus trainees, and the lumping of the trainees with the faculty may explain the overall lower sensitivity of the radiologists. Rudie et al. compared the performance of two initial manual annotations with the deep learning model [42]. Contrary to the findings of the aforementioned study, the detection sensitivity was higher and statistically significant for radiologists than for the deep learning model for metastases smaller than 3 mm (63.2% versus 14.7%) and metastases between 3 and 6 mm (90.8% versus 76.7%). Liang et al. discovered that their deep learning model detected 2% of BMs that were overlooked during the manual annotation, even though the manual annotations were checked by two investigators and reevaluated by a senior radiation oncologist [14]. In addition, Yin et al. showed that with the assistance of the deep learning model, readers' mean sensitivity increased by 21% [44]. Another meta-analysis that compared deep learning models' performance with healthcare professionals in detecting diseases from medical images found that the diagnostic performance of deep learning models and healthcare professionals was equivalent [65]. We believe these results are auspicious, and although completely automated models cannot be fully implemented into clinical practice today, they may serve as a robust assistant for radiologists. We could not conduct a pooled analysis to compare deep learning models and radiologists in our study due to the small number of studies reporting radiologists' sensitivity. The literature is lacking in this regard, and more papers comparing the two groups are needed, especially for small, difficult to detect lesions.
Our study was not without limitations. The main limitation was the lack of pooled false positive rate analysis since there was no uniform unit of reporting them in the included studies. Studies reported false positives in per-patient or per-scan or per-lesion. Second, subgroup analyses based on different MRI scanners and slice thicknesses were not possible because the results for different scanners and slice thicknesses were not included in any of the included studies. Furthermore, study heterogeneity was high in our analyses; but it was commonly observed in meta-analyses on imaging-based deep learning studies [13,[66][67][68][69]. In addition, in our study, the subgroups with no evidence of heterogeneity included very few studies. However, it is known that Q test has inadequate power to detect true heterogeneity when the meta-analysis includes a small number of studies [70]. Therefore, it is likely that heterogeneity is not due to subgroups. It is worth noting that, even though it is not a limitation of our study, our quality assessment showed that reporting was poor in some studies.

Conclusions
Our study revealed that deep learning algorithms effectively detect BMs with a pooled sensitivity of 89%. In the future, deep learning studies should adhere to CLAIM and QUADAS-2 checklists more strictly. Uniform reporting standards that clearly explain the results of deep learning models in BMs detection are needed. Since all the included studies were conducted retrospectively, there is a need for additional large-scale prospective studies.