Evaluating the Use of rCBV as a Tumor Grade and Treatment Response Classifier Across NCI Quantitative Imaging Network Sites: Part II of the DSC-MRI Digital Reference Object (DRO) Challenge

We have previously characterized the reproducibility of brain tumor relative cerebral blood volume (rCBV) using a dynamic susceptibility contrast magnetic resonance imaging digital reference object across 12 sites using a range of imaging protocols and software platforms. As expected, reproducibility was highest when imaging protocols and software were consistent, but decreased when they were variable. Our goal in this study was to determine the impact of rCBV reproducibility for tumor grade and treatment response classification. We found that varying imaging protocols and software platforms produced a range of optimal thresholds for both tumor grading and treatment response, but the performance of these thresholds was similar. These findings further underscore the importance of standardizing acquisition and analysis protocols across sites and software benchmarking.


INTRODUCTION
The National Cancer Institute's Quantitative Imaging Network (QIN), Radiological Society of North America's Quantitative Imaging Biomarkers Alliance (QIBA), and the National Brain Tumor Society's (NBTS) Jumpstarting Brain Tumor Drug Development Coalition all have initiatives aiming to standardize Dynamic Susceptibility Contrast (DSC) MRI protocols and postprocessing methods. Standardization of relative cerebral blood volume (rCBV) as a quantitative biomarker for glioma care is warranted because of the increased adoption of rCBV into multisite clinical trials and protocol variability could impact its use as a reliable biomarker of response (1)(2)(3). For example, a recent systematic meta-analysis of 26 published studies found that although DSC-MRI accurately distinguishes tumor recurrence from post-treatment radiation effects within a given study, inconsistency of DSC-MRI protocols between institutions led to substantial variability in reported optimal thresholds. These resulting inconsistencies emphasize the need for greater consistency before a specific quantitative DSC-MRI strategy is adopted across institutions for routine clinical use (4). To overcome this challenge, the American Society of Functional Neuroradiology (ASFNR) provided a minimal set of protocol recommendations for the acquisition of clinical DSC-MR images (5). In addition, the aforementioned initiatives, e.g. QIBA, are working to release more comprehensive recommendations on imaging protocol and postprocessing methods (O. Wu, Personal Communication, January 24, 2020).
In a previously published study involving 12 sites within the NCI's Quantitative Imaging Network (QIN), variable imaging protocols (IPs) and postprocessing methods (PMs) were found to reduce rCBV reproducibility (6). In contrast, another QIN study showed that if acquisition and preprocessing steps were held constant, the variability between sites greatly diminished such that a global threshold to distinguish low-from high-grade tumor could be identified (7). This study extends these previous investigations by evaluating the potential impact of variable IPs and PMs on 2 clinical use cases, namely, classification of brain tumor grade and treatment response assessment. Virtual tumors were designed to reflect each one of these clinical cases using a DSC digital reference object (DRO) representative of a wide range of glioma MR signals. Using these virtual patients, the aim of this study was to evaluate the influence of the previously characterized rCBV reproducibility as a classifier for tumor grade and treatment response.

MATERIALS AND METHODS
The previously validated population-based DRO used in this study encompasses 10 000 unique DSC-MRI tumor voxels and was simulated for each IP provided by the 12 participating QIN sites (6,8). In addition to these site-specific DROs, an additional DRO was simulated using parameters from the standard imaging protocol (SIP) as defined by the ASFNR (5). All sites used their PM of choice to compute rCBV maps from these simulated DROs. In summary, the majority of IPs submitted were similar in alignment with the ASFNR recommendations (5), whereas a variety of software platforms were used (IB Neuro, nordicICE, PGUI, 3D Slicer, Philips IntelliSpace Portal [ISP], and in-house processing scripts). A detailed description of each site's IPs and PMs are tabulated in Bell et al.'s (6) study, and tables are reprinted with permission (see online supplemental Tables 1 and 2).
As outlined in the previously published manuscript, there were 3 phases to this study to evaluate the effects of various IPs and PMs (6). Phase I ("site IP w/constant PM") required the managing center to process rCBV maps for each site-specific DRO. Computation of rCBV was based on previously optimized methods (9). Some sites provided more than one IP owing to differences in field strengths (n = 15  Table 1). Phase II ("constant IP w/site PM") required each site to process rCBV maps from the standard protocol using their PM of choice. Two sites chose to process rCBV maps using multiple software platforms (see online supplemental Table 2), resulting in 17 submitted rCBV maps (see online supplemental Table 1; second to last column). Phase III ("site IP w/site PM") allowed each site to process their own rCBV maps using their IP and PM, which yielded 25 rCBV maps (see online supplemental Table 1-last column). In total, across all 3 phases, 61 rCBV maps were analyzed in this study.

Virtual Tumor Development
Two clinical data sets were identified for each clinical case investigated in this study (more details for each case appear after this paragraph). The mean tumor rCBV values were known a priori for each subject in each data set. The managing center simulated a reference DRO to match the imaging parameters used in the selected clinical data set and then processed these time curves into an rCBV map using previously detailed methods (8,9). In general, virtual tumors were created by selecting 25 pixels of the 10 000 pixels possible from the rCBV map (produced by the reference DRO) such that the mean of these pixels matched each clinical patient. Specifically, this was done by first applying a threshold to identify the DRO pixels whose rCBV value matched each patient-specific mean rCBV to within 620%, allowing for intratumor heterogeneity. From this pool of DRO indices, 25 voxels were randomly selected. Repeat tumor indices were not allowed for each consecutive simulated tumor. Once these 25 pixels were selected, the mean virtual tumor rCBV values could be found. To evaluate the effects of varying IPs and PMs on tumor grading, these masks were then retrospectively applied to all the 61 submitted rCBV maps for each of the 3 phases outlined above. Specific details for each aim of the study are outlined below, including a flowchart to demonstrate the steps involved ( Figure 1).
Case 1: Tumor Grade Classification. A publicly available data set on The Cancer Imaging Archive (TCIA) was used to study rCBV-based classification of high-grade gliomas (HGGs) and low-grade gliomas (LGG)s (10,11). This data set contains 49 DSC-MRI images of low-(LGG; n = 13) and high-grade (HGG; n = 36) glial brain lesions with previously published mean rCBV values for each tumor (7). In the end, 24 LGG and 72 HGG virtual tumor masks were simulated.
Case 2: Consistency of Longitudinal rCBV Differences Owing to Treatment. The data set used for Case 2 originated from a previous study by Schmainda et al. (3), in which rCBV changes measured in HGGs undergoing Bevacizumab therapy were shown to be predictive of overall survival. This data set contains 36 subjects with 2 imaging time points, namely, pretreatment (preTx) and posttreatment (postTx). The authors provided the mean rCBV value for each subject at each time point. Using this information, virtual tumors were created for each time point for 35 subjects for a total of 70 virtual tumors. In this cohort, 24 subjects were identified as responders (determined by overall survival) and 11 subjects as nonresponders. The mean percent difference in rCBV was then calculated by (rCBV postTx À rCBV preTx) /rCBV preTx Â 100%. In the end, 68 virtual masks (preTx and postTx) were simulated for 23 responders and 11 nonresponders.

Statistical Analysis
The ability for rCBV to classify tumor grade and therapy response was evaluated using receiver operating characteristics (ROC) analysis. From the ROC analysis, the area under the ROC curve (AUROC) and optimal threshold (defined by where the sensitivity and specificity from the ROC analysis overlap) are reported for each submitted DRO. Boxplots were used to show the distribution of the ROC results. The boxplot lines ("whiskers") are drawn from the 25th and 75th percentiles of the samples, and any observations outside of these are considered outliers. All statistical tests were done in MATLAB (The MathWorks, Inc., Natick, MA).

RESULTS
The distributions of rCBV for both virtual cases were similar to their respective clinical data sets (Figures 2-3). For tumor grading ( Figure 2B), the mean values were within 10% and 3% for the LGG and HGG populations, respectively. For treatment response ( Figure 3B), the mean values were within 0.35% and 1.5% for the preTx and postTx, respectively. Results specific to each case are detailed below.
Boxplots of the ROC results are summarized in Figure 4 (optimal thresholds) and Figure 5 (AUROC) for each clinical use case. In general, the range of thresholds increases for phase II and phase III when compared with phase I of the study highlighting the effect of varying PMs. The range of optimal thresholds is narrower for phase I where only IPs differed. Also noted are the wider distributions of optimal thresholds for the tumor grading ( Figure 4A) compared with those for treatment response ( Figure  4B). For tumor grading ( Figure 4A), the IP without a preload for a single-echo acquisition (phase I) and the second definition from Philip's ISP (phase II) result in optimal thresholds that are deemed outliers. For treatment response ( Figure 4B), all IP with a preload <1 standard dose for a single-echo acquisition (phase I) and PMs' methods that included 3DSlicer and in-house scripts (phase II) resulted in optimal thresholds that markedly differed from the rest of the population. Importantly, the clinical performance of rCBV was highly consistent across sites that used similar IPs and PMs.
Despite the heterogeneity in optimal thresholds for varying IPs and PMs, tighter boxplot distributions in AUROC results are observed across all 3 phases of the study ( Figure 5). In general, the distribution of AUROC is the narrowest when a constant PM is used. There are clear outliers for each clinical use case and all result in a decreased AUROC. For tumor grading ( Figure 5A), the 4 outliers observed for phases II and III are those that used PGUI and an in-house script. Note that the optimal threshold outliers do not correlate to the AUROC outliers.    (6). In this follow-up study, we further explore how reduced reproducibility affects the potential clinical utility of rCBV with the overarching goal to improve the utilization of quantitative imaging biomarkers extracted from DSC-MRI in neuro-oncology.
The clinical performance of tumor grading and treatment response is generally not diminished with reduced rCBV reproducibility owing to variations in IP and PM, highlighting the  robustness of rCBV as a biomarker. All imaging protocols submitted resulted in similar AUROC ($0.8 for tumor grading and $0.9 for treatment response) when a standardized PM was applied. However, the IP with a preload of <1 standard dose resulted in optimal threshold values that differed from all other IPs. This most likely resulted from an underestimation of rCBV caused by insufficiency of the leakage correction algorithms to account for the considerable T1 leakage effects that arise in the absence of a preload and optimal pulse sequence parameters. When IP was controlled, the majority of the PMs used in this study yielded rCBV values that were effective classifiers for tumor grading and treatment response, including IB Neuro, nordicICE, 3D Slicer, and Philips ISP. However, 2 of these 4 software packages (disregarding inhouse scripts results) produced different optimal thresholds that differed from the rest: 3D Slicer for the treatment response case and Philips ISP for the tumor grading case. The methods that deviated from the mean AUROC included PGUI (AUROC $0.60 for both clinical cases) and an in-house processing script (AUROC is 0.72 and 0.80 for tumor grading and treatment response, respectively). This result highlights the importance of benchmarking software used for DSC-MRI analysis. The narrower distributions of optimal thresholds and AUROC for the treatment response use case, when compared to tumor grading, are most likely owing to the percent difference calculation partially offsetting protocol-specific rCBV variability. The rCBV variability is most likely equally sensitive to variations in rCBV due to different imaging protocols and postprocessing methods. Note that this study did not analyze the effect of protocol variations between 2 imaging time points.
Taken together, results of the 2 QIN DSC-MRI DRO studies strongly justify the continuation of current efforts to standardize IPs and PMs, particularly when rCBV is to be used as a quantitative biomarker of treatment response in multisite clinical trials. Even though individual site protocols maintained their clinical performance utility, the site-to-site threshold variability indicates applying the same threshold across sites using different PM is not currently recommended. The lack of consistency of thresholds between PMs even when the same IP and leakage correction algorithms are used (most likely owing to differences in implementation) highlight the need for benchmarking software packages. Because it is unlikely that all vendors can provide exactly the same algorithms and implementation, we propose 2 levels of validation. The first level consists of performing the scientific studies to validate that the software provides clinically meaningful results. The second is to use a benchmark calibration method, such as the DRO used for these studies, so that each vendor can provide the threshold that should be used for a particular test. Only in that way will we have both the freedom to select the software of our liking and carry out cross-site studies using quantitative measures.
In conclusion, results from this study show that reduced multisite rCBV reproducibility owing to heterogeneous IPs and PMs would confound the reliable use of this biomarker in clinical trials, and further emphasize the need for harmonization of acquisition and analysis methods.

Supplemental Materials
Supplemental