Robust AI-Driven Segmentation of Glioblastoma T1c and FLAIR MRI Series and the Low Variability of the MRIMath© Smart Manual Contouring Platform

Patients diagnosed with glioblastoma multiforme (GBM) continue to face a dire prognosis. Developing accurate and efficient contouring methods is crucial, as they can significantly advance both clinical practice and research. This study evaluates the AI models developed by MRIMath© for GBM T1c and fluid attenuation inversion recovery (FLAIR) images by comparing their contours to those of three neuro-radiologists using a smart manual contouring platform. The mean overall Sørensen–Dice Similarity Coefficient metric score (DSC) for the post-contrast T1 (T1c) AI was 95%, with a 95% confidence interval (CI) of 93% to 96%, closely aligning with the radiologists’ scores. For true positive T1c images, AI segmentation achieved a mean DSC of 81% compared to radiologists’ ranging from 80% to 86%. Sensitivity and specificity for T1c AI were 91.6% and 97.5%, respectively. The FLAIR AI exhibited a mean DSC of 90% with a 95% CI interval of 87% to 92%, comparable to the radiologists’ scores. It also achieved a mean DSC of 78% for true positive FLAIR slices versus radiologists’ scores of 75% to 83% and recorded a median sensitivity and specificity of 92.1% and 96.1%, respectively. The T1C and FLAIR AI models produced mean Hausdorff distances (<5 mm), volume measurements, kappa scores, and Bland–Altman differences that align closely with those measured by radiologists. Moreover, the inter-user variability between radiologists using the smart manual contouring platform was under 5% for T1c and under 10% for FLAIR images. These results underscore the MRIMath© platform’s low inter-user variability and the high accuracy of its T1c and FLAIR AI models.


Introduction
Despite recent advancements, glioblastoma multiforme (GBM), the most aggressive primary brain neoplasm, remains associated with a poor prognosis [1].The current standard of care for GBM is maximal safe debulking, followed by concurrent chemoradiation and adjuvant chemotherapy.Magnetic resonance imaging (MRI) of the brain is the primary technique for the evaluation of treatment response and disease progression; radiologists rely on the post-contrast T1 (T1c), fluid-attenuated inversion recovery (FLAIR), T1, T2, and diffusion-weighted imaging sequences to help them detect and diagnose tumor growth [2].
It is recommended within 72 h after surgery for the assessment of residual disease, followed by subsequent MRIs every 2-6 months.Traditionally, the size of GBM on MRI is assessed by measuring the product of the maximal cross-sectional diameters [3].However, there is evidence that 3D volumetric measurements are better at detecting the growth of low-grade gliomas (LLGs) than the maximal cross-sectional diameters, whose accuracy vs. volumetric analysis compared to previous and baseline scans is 21.0% and 56.5%, respectively [4].Also, accurate segmentation and volumetrics are crucial for radiation planning.Finally, radiomics and texture analysis of GBM necessitate accurate segmentation of GBM on MRI.Traditionally, gliomas are segmented manually, which is time-consuming with high inter-observer and intra-observer variation [5,6].The mean kappa score of the gross tumor volume (GTV) of newly diagnosed GBM from a Korean study was 0.58 [6].Hence, a precise and efficient segmentation technique is needed to improve the clinical management of GBM and to answer fundamental research questions.We have developed an automated AI-based segmentation technique for segmenting brain neoplasms on different MRI sequences and a smart manual contouring platform for corrections when needed.Here, we examine the performance of AI models for T1c and FLAIR GBM sequences as compared to boardcertified neuro-radiologists.We also study the inter-user variability of the MRIMath smart manual contouring software (version v1.0.0).

AI Generation
We have developed a proprietary training model architecture through extensive experimentation and iterative refinements.The training model features a U-Net architecture [7], designed as the backbone for end-to-end fully supervised training.Our implementation includes modifications to optimize performance for MRI data segmentation.This model incorporates inception blocks [8] to enhance feature extraction across multiple scales and employs robust initialization and regularization techniques such as dropout [9] and L2 normalization [10] to prevent overfitting.At its core, the architecture relies on an encoderdecoder structure, where the encoder progressively compresses the input into a condensed feature representation, and the decoder expands these features back to the image dimensions, aiming to predict tumor presence with high precision per pixel.The decoder then reconstructs the segmented output, culminating in precise segmentation through deterministic convolution and Softmax activation.Data augmentation techniques [11] such as rotation, flipping, translation, and Gaussian noise injection were employed to enhance the robustness of the model against variations in MRI scans.The model was trained from scratch using a proprietary dataset leveraging the TensorFlow framework [12] for defining and training the architecture.We utilized a standard training loop with an Adam optimizer [13], setting the learning rate at 0.06 and a batch size of 24.Training continued for up to 500 epochs with early stopping implemented to prevent overfitting.Data preprocessing involves converting 2D Dicom files into 2D numpy arrays [14], maintaining the patient-specific folder structure to ensure that training and validation splits are carried out per patient rather than per file.Images were resized to 256 × 256 pixels and normalized to have pixel values between 0 and 1 for training stability.Hyperparameters were determined iteratively, selecting the best model configuration for deployment and statistical analysis.

Training Data and Golden Truth
The training data study involved a comprehensive dataset comprising 2181 T1c series and 1556 FLAIR series MRI scans, featuring resolutions ranging from 256 × 256 to 512 × 512, with data collected between 2001 and 2020, acquired from various universities, community hospitals, and imaging centers across the United States.For T1c, the number of slices per series ranged from 21 to 248, averaging 94.28 slices, with a standard deviation of 75.29, while the FLAIR series had between 21 and 200 slices, averaging 31.30slices with a standard deviation of 27.54.MRI magnetic field strengths for T1c included 3.0 T (n = 1218), 1.5 T (n = 961), and 1.0 T (n = 1), with one unspecified.The FLAIR MR series used similar field strengths, with 961 scans at 3.0T, 585 at 1.5 T, and minor counts at 1.0 T (n = 2) and unspecified (n = 8).The acquisition-type distribution for FLAIR was predominantly 2D (n = 1433), with a smaller proportion of 3D acquisitions (n = 103).
All golden truth segmentation was conducted by a board-certified neuro-oncologist.The datasets were randomly divided into training (80%) and validation (20%) sets.The detailed imaging parameters for the T1C and FLAIR series are presented in Tables A1-A4.A total of 78 patients met the inclusion criteria: 17/78 were excluded because their T1C and FLAIR series had missing slices.

Sample Size Calculation
To evaluate the accuracy, we compare the performance of the FLAIR and T1C AIs to the consensus golden truth; this procedure generates overall DSC proportions between 0 and 1.We chose to consider an overall DSC proportion of 88% as the reference value in a comparison using a two-sided, one-sample Z-proportion hypothesis testing.
We base our sample size calculation on the hypothesis that the proportion of Sørensen-Dice Similarity Coefficient scores (DSC) exceeding the designated threshold of 88% differs from 50%.We expect that our AI will exceed this threshold 70% of the time.The comparison will be made using a two-sided, one-sample Z-test and type I error alpha of 0.05.Using R version 4.1.1,specifically the pwr package for power analysis [15,16], we estimate that a sample of 42 MRIs is adequate to provide 80% power.
In our work, we are interested in testing our AI on both pre-and post-operative MRIs.We expect at least a third of the participants to have both pre-and post-operative MRI scans that can be used in our study.Therefore, we randomly selected 31 subjects from the pool of participants that meet our inclusion/exclusion criteria.

Studies/Series Selection Procedure
Most brain tumor patients are treated at university hospitals, though some may be initially diagnosed at community hospitals.Because our intention is to produce a sample of GBM MRIs that represent the US, we gave preference to MRI studies performed at community hospitals and imaging centers.This approach resulted in 26 studies performed at university hospitals, compared to 20 studies at community hospitals and imaging centers, totaling 46 studies.
Given the prevalence of 1.5 T machines over 3.0 T machines, especially at smaller institutions, preference was given to 3 T magnets, resulting in 12 studies acquired by 3 T magnets and 34 by 1.5 T magnets.T1C 3D acquisitions, being less common than 2D, especially at community hospitals, were preferred due to their informativeness, leading to 25 series acquired in 3D and 21 in 2D.Preference was also given to FLAIR series with more than 25 slices, resulting in 28 FLAIR sequences with fewer than 25 slices and 18 sequences with more than 25 slices (range = . The selection procedure detailed above resulted in 46 MRI studies from 31 patients, acquired from a diverse set of 19 centers across the United States, including 13 community hospitals and clinics, 4 imaging centers, and 2 university hospitals.The MRIs spanned various magnetic field strengths-1.5Tesla (n = 33), 3.0 Tesla (n = 12), and some unspecified (n = 1)-and were performed using equipment from major manufacturers-GE Medical Systems (Convington, GA, USA) (n = 18), Philips Medical Systems (Oakwood, GA, USA) (n = 13), Philips Healthcare (Andover, MA, USA) (n = 9), and Siemens (Erlangen, Germany) (n = 6).The MRI acquisition types included both 2D (T1c: n = 21; FLAIR: n = 44) and 3D (T1c: n = 25; FLAIR: n = 2) formats.The imaging parameters are shown in Tables A5-A8.Various contrast agents were employed depending on machine compatibility and specific imaging requirements, including Gadolinium, Prohance, Omniscan, Dotarem, Magnevist, Multihance, and Optimark.The MRI image resolution configurations included 256 × 256, 288 × 288, 384 × 384, 432 × 432, and 512 × 512.The T1c repetition time ranged from 9.3 to 666.7 ms; the echo time ranged from 2.6 to 14.9 ms.

Annotators, Tasks, and Golden Truth
Three board-certified neuro-radiologists annotated the testing datasets.Each neuroradiologist was tasked with manual contouring of the T1c and FLAIR sequences of the 46 MRIs of patients diagnosed with GBM.They used the smart manual contouring platform of the MRIMath platform.Specifically, they uploaded the images onto the MRIMath platform, performed manual segmentations of the T1c and FLAIR sequences, and then downloaded the data to their computers.They shared the data with the MRIMath for analysis.A single consensus ground truth was generated from the annotations of the three neuro-radiologists by majority voting, considering each pixel as a tumor or not, based on at least 2 out of 3 votes.ϵ is a minor constant added for computational stability.where ϵ is an arbitrarily chosen small constant set to 10 −6 .This value is used to prevent division by zero in cases where the denominator is null, and ensuring DSC equals 1, in cases where both predicted and GT values are zero, correctly reflecting a perfect match.Thus, the use of ϵ ensures numerical stability in the computation of the DSC.
True positives pertain to slices where tumors are identified and confirmed by a specified reference, serving as the ground truth (GT) for a given comparison.

Sensitivity and Specificity
These metrics are defined as follows: Specificity = True Negatives True Negatives + False Positives Sensitivity = True Positives True Positives + False Negatives Sensitivity and specificity are measured both at the per-slice and per-pixel levels, the latter evaluating each individual pixel per scan.

Hausdorff Distance
In our analysis pipeline, the Hausdorff distance quantifies the discrepancy between ground truth (GT) and predicted segmentation outcomes in the MRI T1c and FLAIR test datasets.In particular, the 95th percentile (Hausdorff 95) is used as a metric for evaluating segmentation accuracy, computed per lesion on a slice-by-slice basis within 3D scans [17][18][19].
Initially, each 3D image volume, consisting of multiple 2D slices, is identified and labeled as "independent objects" within both GT and prediction masks.For each grouped pair of objects, the Hausdorff distance was calculated at a 95% confidence interval.We report the means, standard deviations, and confidence intervals of the Hausdorff distances across the dataset.

Statistical Methods
To test the hypothesis that the proportion of overall AI DSC measurements exceeding the designated threshold of 88% differs from 50%, a two-sided, one-sample Z-test was utilized.Proportions of DSC above the threshold, along with their corresponding 95% confidence interval and p-values, are presented.An alpha level of 0.05 was employed to assess significance [15,16].The analyses were conducted using R version 4.1.1.
Box plots were utilized to visually represent the distribution of Dice scores.To compare the volumes measured by the AI and three neuro-radiologists, the following methods were applied: • Linear regression (R 2 ) for measuring the degree of correlation.• Bland-Altman analysis was used to assess the agreement between two methods of clinical measurement.To evaluate the range within which the vast majority (95%) of the differences are expected to lie, we define the Limits of Agreement (LoAs) as the mean difference ±1.96 times the standard deviation of the differences, which represents a critical value of the standard normal distribution at a 95% confidence level.We report the 95% confidence intervals for the mean difference.

•
Cohen's Kappa Score (κ) measures the agreement between two raters who categorize instances into mutually exclusive categories [20].The Kappa statistic for AI-based medical imaging evaluates the agreement between the AI algorithm's segmentation and expert radiologist's annotations, calculated as κ = P o −P e 1−P e , where P o is the observed agreement and P e is the expected agreement.
To assess the statistical significance of the Kappa scores, we calculated p-values using a z-score method, which accounts for the variability in Kappa estimation.This involves computing the standard deviation of Kappa and the z-score to derive the p-value, thereby providing a robust measure of agreement significance beyond chance.

Patients
The cohort comprised 31 patients diagnosed with GBM, including 16 females and 15 males, who underwent 46 MRI studies.Among these studies, 24 were preoperative and 22 were postoperative MRIs.The mean age of the patients was 56.68 years, with a standard deviation of 12.76 years.The median age was 58 years with an interquartile range (IQR) of 14.25 years.The racial distribution was predominantly white (n = 29) with African American (n = 2), mirroring the incidence rates in the US population (white = 83.2%,black = 5.9%) [21].

Hypothesis Testing
We evaluated the hypothesis that the proportion of overall AI DSC measurements exceeding a designated threshold (p 0 = 0.88) differs from 50%.The two-sided, one-sample Z-test revealed the following:

•
For the FLAIR AI, the DSC proportions exceed p 0 , 74% of the time, with a confidence interval (CI) of (60%, 84%) and a p-value of 0.001.This result indicates that our proportion is significantly different than 50%.• For T1C, the DSC proportions exceed p 0 , 89% of the time, with a CI of (77%, 95%) and a p-value of < 0.001.This also implies that our proportion is significantly different than 50%.
These results demonstrate that the measurements from our device, based on both AI models, are statistically significant and clinically relevant.The high proportion of DSCs exceeding the p 0 threshold demonstrates the efficacy of the AI in aligning with the consensus ground truth, from the three radiologists.This high level of agreement underscores the high potential of our AI models to support and enhance diagnostic accuracy in clinical settings, as detailed in Table 1.<0.001

Dice Score Comparisons
In this comprehensive analysis, we present a comparative analysis of the performance of AI-generated segmentation with the consensus GT, as well as the inter-radiologist agreement for the FLAIR and T1c imaging modalities.We measure the DSC of the entire dataset and of the true positive set, i.e., the slices with tumors.Figure 1 displays the T1c and FLAIR AI-generated segmentation alongside the consensus GT delineations for small and large tumors.The depicted tumors are highlighted with a semi-transparent red overlay and are delineated by a solid red outline.The comparison clearly demonstrates a high degree of agreement between the AI-generated segmentation and the consensus GT, affirming the efficacy of both FLAIR and T1c AIs in accurately reproducing the expert radiologists' assessments.

Overall Dice Scores for T1c and FLAIR Modalities
Table 2 illustrates the agreement level between AI-generated segmentation and consensus GT for both the T1c and FLAIR modalities, along with the consistency of interradiologist assessments.The AI-consensus pairing for the T1c modality achieves a mean DSC of 94.72%, with a 95% CI of (93.31%, 96.13%), similar to the mean DSC values observed for the radiologist's pairings ranging from 95.44% to 95.78%.For the FLAIR modality, the AI-consensus pairing achieves a mean DSC of 89.47%, with a 95% CI of (86.82% to 92.12%), comparable to those obtained from the radiologist comparisons, whose mean DSC ranges from 89.32% to 91.64%.The box plots of the overall T1c and FLAIR DSC reveal a consistent median convergence across all tested pairings, suggesting synchronized performance between the AI system and radiologists (Figure A2).These results underscore the following: 1.The solid alignment of the T1c and FLAIR AIs with the consensus GT.
2. The low variability between the radiologists using the MRIMath Smart contouring platform.

True Positive Dice Scores
The true positive DSC for both T1c and FLAIR are also similar to the measurements obtained by comparing the radiologists (see Table A13).In particular, the T1C and FLAIR AI models mean that DSCs are 81.43% and 77.62% with 95% CI ranges of (75.60%, 87.26%) and (71.42%, 83.81%), respectively.The mean DSC between radiologists ranges from 76.33% to 86.09% and 75.10% to 83.38% for T1c and FLAIR images, with a 95% CI of (70.33%, 89.42%) and (71%, 87.22%), respectively.The box plots for the true positive DSC are shown in Figure A1.

Dice Score Subgroup Analysis
We conducted a subgroup analysis focusing on different settings within the dataset, including institutional type (university hospitals, community, and imaging centers), MRI manufacturers (GE, Philips, Siemens), lesion size, single and multiple tumors, field strength, acquisition type (2D, 3D), and operative status (pre, post).Table 3 reveals that the segmentation models exhibit a high degree of accuracy and consistency, highlighting the robustness of the models across various clinical and technical settings.The lowest DSC of 85.18% is measured from small tumors for FLAIR imaging, reflecting the increased sensitivity of the DSC to smaller tumor volumes, where minor segmentation inaccuracies become more significant.

Sensitivity and Specificity
This section presents an in-depth analysis of slice-wise and pixel-wise specificity and sensitivity for both T1c and FLAIR modalities, comparing AI-generated results with those obtained from the neuro-radiologists.The slice-level analysis for the T1c and FLAIR modalities show that the AI models achieve high specificity and sensitivity, closely aligning with or even surpassing radiologist benchmarks (Table A9).Specifically, the specificity metrics across all AI-radiologist pairings indicate near-perfect performance (T1c: 97.49%, FLAIR: 96.10%), suggesting a high degree of accuracy in correctly identifying negative cases.Sensitivity results are also robust (T1c: 91.63%, FLAIR: 92.09%) and comparable to the levels observed between radiologists.

Hausdorff Distance
Table 4 reveals that, as compared to a radiologist, the AI exhibits a consistent range of mean Hausdorff distances, all of which are notably below 5 mm for both T1 and FLAIR modalities.This uniformity highlights the AIs' capacity to reliably capture the essential contours of the segmented objects with a high degree of fidelity across modalities [1,3,4].  2 plots the linear relationship between the volumes measured by T1C and FLAIR AI versus the consensus FT and among the radiologists.The results confirm a high degree of correlation and agreement between the consensus GT and both the T1c (R 2 = 0.965 for the OLS line; R 2 = 0.939 for the y = x line) and FLAIR AI (R 2 = 0.967 for both the OLS line and y = x lines).A detailed comparison across different radiologist pairings reveals that the agreement between the AI and the consensus GT R 2 is similar to the radiologists' for both the OLS and the y = x lines (Table A14).In panels (a) and (e), the T1c and FLAIR AI models demonstrate an exceptional correlation with the consensus ground truth, as evidenced by R 2 values close to 1 and regression slopes nearly equivalent to the line y = x, indicating not only high predictive accuracy but also volume conservation in the tumor segmentation.These findings are corroborated by the high degree of alignment in the regression slopes and intercepts.A14.
Specifically, the slopes of the best fits of the AI to consensus GT are closer to 1 (0.886 for T1c and 1,007 for FLAIR) as compared to most of the comparisons between radiologists (0.845, 0.759, 0.641 for T1c and 1.001, 0.851, 0.837 for FLAIR; Table A14).Comparative analysis among radiologists, shown in panels (b) to (d) for T1c and (f) to (h) for FLAIR, reveals a generally high level of agreement, with R 2 values consistently above 90%.Furthermore, the high R 2 among different pairings of radiologists compared to the y = x line, detailed in Table A14, reflects a high level of consistency, ensured by MRIMath©'s smart manual contouring.

Tumor Volumes: Bland-Altman Analysis
Figure 3 presents the Bland-Altman plots assessing the agreement of segmented tumor volumes between the T1C and FLAIR AIs and the consensus GT, as well as among different radiologists.The figure reveals a tight correlation between the volumes measured by the T1C and FLAIR AIs as compared to the consensus golden truth.For example, in the T1C analysis, the mean difference in volume measurement between AI and consensus GT is 2065 mm 3 (Table A15), which is in the range of differences measured between radiologist pairings (1583 mm 3 , 3720 mm 3 ).In the FLAIR analysis, the AI vs. consensus GT mean difference is 154 mm 3 , considerably smaller than the minimal difference between radiologists of 1040 mm 3 .Table A15 also demonstrates that the Limits of Agreement for both T1c and FLAIR AIs as compared to the consensus GT are within the range or better than what is measured from the radiologist pairings.These findings highlight AI's capability to maintain a high level of precision and reliability in tumor volume assessments.(d,h).Limits of Agreement (LoAs) and mean difference with 95% confidence intervals are depicted.Results are summarized in Table A15 for T1c and FLAIR modalities.

Kappa Score (k)
Table A16 presents a comparative analysis of Kappa scores revealing substantial agreement with a κ scores of 0.7617 and 0.6867 for the T1c and FLAIR AIs as compared to the consensus GT, respectively.These agreement levels are within the range of scores obtained by comparing the radiologists.

Variability of the Smart Manual Contouring Platform of MRIMath
The results demonstrate that the manual contouring platform of MRIMath is associated with low variability (<5% for T1C and <10% for FLAIR).Furthermore, there was no statistically-significant difference between the DSCs and volumes measured by the three neuro-radiologists.

Discussion and Conclusions
Our results reveal that the T1C and FLAIR AIs' pixel and volume predictions align closely with the manual contouring by board-certified neuro-radiologists.Furthermore, our smart manual contouring system yields low inter-user variability.
We compare the MRIMath© platform's results side by side with other leading platforms such as Neosoma [22] and the state-of-the-art model called Swin Transformer [23], trained on the BRATS dataset [24] by Nvidia and Vanderbilt university [25].Table 5 details key aspects of model operation, preprocessing requirements, and performance across these platforms, illustrating the unique approaches of each.MRIMath© is noted for being fully automated by not including preprocessing steps that require human supervision, like deboning, interpolation, and registration; this automation significantly simplifies the preprocessing pipeline compared to the more complex requirements of Neosoma and Nvidia's model.
While MRIMath© processes images in 2D and treats each modality independently, Neosoma and Brats employ a 3D approach and integrate four modalities.Moreover, MRIMath© adopts a streamlined approach by handling a single series type (FLAIR or T1c), which contrasts with the subcomponent segmentations used by the other platforms.Notably, the performance of MRIMath© is highlighted with DSC scores of 89.47% for FLAIR and 94.79% for T1c.Neosoma shows an average DSC of 88.3% for preoperative data and 77.6% for postoperative data.The Nvidia model yields an average DSC of 91.3%.The output of the MRIMath© FLAIR AI is equivalent to the sum of all the subcomponents measured by Brats and Neosoma; the T1C AI segmentation is equivalent to the sum of the enhancing and necrosis subcomponents.These results underscore the distinct advantages of MRIMath© in enhancing the efficiency and accessibility of tumor segmentation for clinical applications.As compared to the literature, the manual contouring platform of MRIMath is associated with low inter-user variability, 5% for T1C, and 10% for FLAIR.The larger interobserver variability for FLAIR is thought to be due to vague and imprecise boundaries [5,6].The mean kappa of the gross tumor volume (GTV) of newly diagnosed GBM from a Korean study was 0.58 as compared to 0.77 for the MRIMath manual contouring [6].In a recent report, the mean DSC of the GTV of the FLAIR signal of low-grade gliomas was reported at 77% (substantial disagreement) [26]; in contrast, the mean DSC of the manual contouring of FLAIR images using the MRIMath smart platform was 91% (see Table 2).
The MRIMath GBM AIs can potentially by applied in radiotherapy and neurosurgical planning by improving efficiency, saving time, and lowering inter-user variability.They can also be applied to evaluate and update the current standard of care for longitudinal tumor monitoring including the RANO criteria [4].By delivering precise and reliable segmentation within seconds, our AI tools set a foundation for accurate volumetric evaluations of tumor progression, which are pivotal for longitudinal monitoring and for clinical trials.The MRIMath smart manual contouring platform offers a safety net that allows physicians to review and approve AI segmentation efficient and with low variability.An efficient and robust segmentation is also needed for the clinical analysis of PET scans.From a research

2. 4 .
Evaluation Metrics 2.4.1.Overall Dice Score per Patient The Dice Score (DSC) is utilized to measure the accuracy of the segmentation compared to a golden truth.It is calculated as follows: DSC = 2 × TP + ϵ 2 × TP + FP + FN + ϵ In this equation: • TP (true positive) represents accurately segmented pixels.• FP (false positive) indicates erroneously segmented pixels.• FN (false negative) denotes missed pixels in the segmentation process.•

Figure 1 .
Figure 1.Contours of the AI (a,c,e,g) and consensus GT (b,d,f,h) for the T1c (a-d) and corresponding FLAIR series (e-h).(e,f) are the FLAIR sequences that correspond to the small tumor in (a,b).(g,h) are the FLAIR sequences that correspond to the large tumor in (c,d).Tumor segmentation is marked with a semi-transparent red overlay and delineated by a solid red outline.

Figure 2 .
Figure 2. Linear regression analysis of volumes across AI and radiologist pairings for both T1C (a-d) and FLAIR (e-h) series.The analyses include AI vs. consensus GT (a,e), R1 vs. R2 (b,f), R1 vs. R3 (c,g), and R2 vs. R3 (d,h).The red dashed line represents the y = x line; the solid line represents the OLS regression line; the gray region indicates the confidence interval; the blue "x" marks denote the data points.The results of regression around the OLS line and x = y are summarized in TableA14.

Figure A1 .
Figure A1.Comparison of Dice score true positive distributions for different pairings between AI and radiologists.Box plot showing the Dice score distributions for the six different combinations between AI and radiologists for (a) T1c, and (b) FLAIR modalities.

Table 1 .
Results of the two-sided, one-sample Z-proportion test comparing the MRIMath T1C and FLAIR AIs to the reference proportion p 0 = 0.88.

Table 2 .
Overall DSC Statistics for T1c and FLAIR Modalities.Comparison across AI and radiologists.

Table 3 .
Sub-group analysis.Average Dice score for the T1c and FLAIR AIs.

Table 5 .
Comparison of preprocessing requirements, model characteristics, and performance.

Table A3 .
Training FLAIR series MR imaging parameters.

Table A4 .
Training FLAIR series MR imaging parameters-continued.

Table A9 .
T1c AI slice-level specificity and sensitivity measures.

Table A10 .
FLAIR AI slice-level specificity and sensitivity measures.

Appendix C. Pixel-Wise Specificity and SensitivityTable A11 .
T1c AI pixel-level specificity and sensitivity measures.

Table A12 .
FLAIR AI pixel-level specificity and sensitivity measures.