Deep Learning Segmentation of Triple-Negative Breast Cancer (TNBC) Patient Derived Tumor Xenograft (PDX) and Sensitivity of Radiomic Pipeline to Tumor Probability Boundary

Simple Summary Co-clinical trials are an emerging area of investigation in which a clinical trial is coupled with a corresponding preclinical trial to inform the corresponding clinical trial. The preclinical arm aids in assessing therapeutic efficacy, patient stratification, and designing optimal imaging strategies. There is much interest in harmonizing preclinical and clinical quantitative imaging pipelines. Radiomics is widely explored in clinical imaging to predict response to therapy. In preclinical imaging, high-throughput radiomic analysis is limited by manual delineation of tumor boundaries, which is labor intensive with poor reproducibility. Our proposed deep-learning-based system was trained to automatically segment tumors from multi-contrast MR images and extract radiomic features. The proposed method is highly reproducible with significant correlation in radiomic features. The deployment of this pipeline in the preclinical arm would provide high throughput and reproducible radiomic analysis. Abstract Preclinical magnetic resonance imaging (MRI) is a critical component in a co-clinical research pipeline. Importantly, segmentation of tumors in MRI is a necessary step in tumor phenotyping and assessment of response to therapy. However, manual segmentation is time-intensive and suffers from inter- and intra- observer variability and lack of reproducibility. This study aimed to develop an automated pipeline for accurate localization and delineation of TNBC PDX tumors from preclinical T1w and T2w MR images using a deep learning (DL) algorithm and to assess the sensitivity of radiomic features to tumor boundaries. We tested five network architectures including U-Net, dense U-Net, Res-Net, recurrent residual UNet (R2UNet), and dense R2U-Net (D-R2UNet), which were compared against manual delineation by experts. To mitigate bias among multiple experts, the simultaneous truth and performance level estimation (STAPLE) algorithm was applied to create consensus maps. Performance metrics (F1-Score, recall, precision, and AUC) were used to assess the performance of the networks. Multi-contrast D-R2UNet performed best with F1-score = 0.948; however, all networks scored within 1–3% of each other. Radiomic features extracted from D-R2UNet were highly corelated to STAPLE-derived features with 67.13% of T1w and 53.15% of T2w exhibiting correlation ρ ≥ 0.9 (p ≤ 0.05). D-R2UNet-extracted features exhibited better reproducibility relative to STAPLE with 86.71% of T1w and 69.93% of T2w features found to be highly reproducible (CCC ≥ 0.9, p ≤ 0.05). Finally, 39.16% T1w and 13.9% T2w features were identified as insensitive to tumor boundary perturbations (Spearman correlation (−0.4 ≤ ρ ≤ 0.4). We developed a highly reproducible DL algorithm to circumvent manual segmentation of T1w and T2w MR images and identified sensitivity of radiomic features to tumor boundaries.


Introduction
Triple-negative breast cancer (TNBC) is a highly heterogeneous and aggressive cancer characterized by poor outcomes and higher relapse rates compared to other subtypes of breast cancer. Pathological complete response (pCR) is often used as a critical endpoint in the treatment of TNBC following neoadjuvant chemotherapy (NAC) as it is often associated with favorable long-term outcomes. Therefore, it is critical to identify patients who will respond to NAC therapy to avoid the use of ineffective treatments. To support this effort, co-clinical trials are an emerging area of investigation, whereby a clinical trial is coupled with a corresponding preclinical trial to inform the corresponding clinical trial [1][2][3][4][5][6][7]. The emergence of patient-derived tumor xenografts (PDXs) as a co-clinical platform is largely motivated by the realization that established cell lines do not recapitulate the heterogeneity of human tumors and the diversity of tumor phenotypes [8] and that better oncology observer variability arising from manual segmentation. To that end, we extracted radiomic features from the segmented tumor regions to analyze the level of agreement between and within manual and automated methods. In doing so, we examined the sensitivity of the features to tumor boundaries and the reproducibility of the algorithm, signifying its reliability and robustness to that of manual annotation.

Generation of TNBC PDXs
Gene expression analyses of 93 TNBC PDXs (29,657 unique genes/probes) were performed to identify six TNBC subtypes, which included 2 basal-like (BL1 and BL2), an immunomodulatory (IM), a mesenchymal (M), a mesenchymal stem-like (MSL), and a luminal androgen receptor (LAR) subtype [27]. Details regarding animals, surgeries, and tumor xenografts were reported previously [28] and are publicly available at https://c2ir2.wustl.edu/ (accessed on July 26th, 2021). All animal experiments were conducted in compliance with the Guidelines for the Care and Use of Research Animals established by Washington University's Animal Studies Committee.

MR Image Acquisition
MR image acquisition was performed using a MR Solutions small animal simultaneous 7T MR/PET scanner (MR Solutions, Guildford, UK). MR imaging included T1weighted (T1w) and T2-weighted (T2w) sequences acquired in axial oblique planes perpendicular to the spine of the mouse. PDX mice were anesthetized with 1-2% isoflurane throughout imaging sessions. MR imaging data were obtained for forty-nine mice with TNBC PDX tumors implanted in the inguinal mammary fat pad. The spatial resolution was 0.25 mm × 0.25 mm × 1 mm with a 0.1 mm gap between the slices. The imaging field of view (FOV) was fixed at 32 mm by 24 mm to cover the entire tumor and four repetitions were acquired and averaged for improved SNR and to reduce motion artifacts. For each PDX, 12-16 T1w and T2w trans-axial slices were obtained with an image dimension of 128 × 128, and were retrieved from the scanner in DICOM format. The multi-parametric MR image acquisition protocol was as follows: T1w-2D T1-weighted fast spin echo (FSE) multi-slice images were acquired with echo train length 4, echo spacing 11 ms, effective echo time (TE) = 11 ms, respiratory gated with effective repetition time (TR) = 833 ms, respiration rate of about 60 breaths/min. T2w-2D T2-weighted FSE multi-slice images

Generation of TNBC PDXs
Gene expression analyses of 93 TNBC PDXs (29,657 unique genes/probes) were performed to identify six TNBC subtypes, which included 2 basal-like (BL1 and BL2), an immunomodulatory (IM), a mesenchymal (M), a mesenchymal stem-like (MSL), and a luminal androgen receptor (LAR) subtype [27]. Details regarding animals, surgeries, and tumor xenografts were reported previously [28] and are publicly available at https: //c2ir2.wustl.edu/ (accessed on 26 July 2021). All animal experiments were conducted in compliance with the Guidelines for the Care and Use of Research Animals established by Washington University's Animal Studies Committee.

MR Image Acquisition
MR image acquisition was performed using a MR Solutions small animal simultaneous 7T MR/PET scanner (MR Solutions, Guildford, UK). MR imaging included T1-weighted (T1w) and T2-weighted (T2w) sequences acquired in axial oblique planes perpendicular to the spine of the mouse. PDX mice were anesthetized with 1-2% isoflurane throughout imaging sessions. MR imaging data were obtained for forty-nine mice with TNBC PDX tumors implanted in the inguinal mammary fat pad. The spatial resolution was 0.25 mm × 0.25 mm × 1 mm with a 0.1 mm gap between the slices. The imaging field of view (FOV) was fixed at 32 mm by 24 mm to cover the entire tumor and four repetitions were acquired and averaged for improved SNR and to reduce motion artifacts. For each PDX, 12-16 T1w and T2w trans-axial slices were obtained with an image dimension of 128 × 128, and were retrieved from the scanner in DICOM format. The multi-parametric MR image acquisition protocol was as follows: T1w-2D T1-weighted fast spin echo (FSE) multi-slice images were acquired with echo train length 4, echo spacing 11 ms, effective echo time (TE) = 11 ms, respiratory gated with effective repetition time (TR) = 833 ms, respiration rate of about 60 breaths/min. T2w-2D T2-weighted FSE multi-slice images were obtained with echo train length 4, echo spacing 15 ms, effective echo time (TE) = 45 ms, respiratory gated with effective repetition time (TR) = 5000 ms, respiration rate of about 60 breaths/min.

Manual Segmentation of the MR Images
An in-house GUI portal was developed using MATLAB R2020a (MathWorks, Natick, MA, USA) for manual delineation of tumor boundaries from the DICOM MR images. Four experts with experience in drawing ROIs on preclinical MR images were selected to annotate the imaging data. The readers were instructed to delineate the boundaries of the tumor from the T2w MR images axially through sections of the tumor as deemed by the expert. Each of the readers annotated all the test cases in a single continuous setting and identical display device and lighting conditions were used for each reader. Labels annotated by one expert were used as ground truth for training of the neural network, while the labels delineated by four experts were used to test the performance of the network. For test-retest, the tumor boundary delineation was performed by the same set of experts within a one-week gap under identical conditions.

Network Architecture
A fully CNN-based on U-Net architecture was implemented to generate the segmentation maps from the PDX MR images using Keras and TensorFlow framework written in Python. The basic U-Net [18] architecture combines a down-sampling (encoder) path to capture the contextual information followed by a symmetrical up-sampling (decoder) path for accurate localization of the features. To increase the amount of contextual information in the up-sampling path, skip connections [29] are implemented to directly concatenate the feature maps from the encoder to the decoder portion of the network. Our implemented model utilized recurrent convolutional layers (RCL) [20] with two time steps, i.e., it performed two subsequent recurrent convolutions and additions following the regular convolution layer. It also used the residual connections for direct addition of the previous layer's output to alleviate the vanishing gradient problem [21]. Dense interconnections were applied to facilitate direct concatenation of the previous layer's information into current layer output, enhancing feature reuse [22].
Convolutional layers used kernels of size 3 × 3 with a max pooling operation of 2 × 2 for detection of multiscale features in the encoder portion. Deconvolutional layers of kernel size 3 × 3 were used in the decoder portion of the network. Activation layers after each convolution operation were set as non-linear rectilinear activation units (i.e., ReLU) and a sigmoid function was used for the final activation function, setting the network's output in the range of 0 and 1. In order to mitigate the effect of overfitting of the network due to the small dataset size available for training, spatial dropouts were implemented. The dropout layers [30] were applied prior to the max pooling in the deeper layers of the network in the main architecture and between the RCL blocks to force the network to efficiently learn the finer image features without overfitting to the peculiarities of just the training data. In all, five network architectures were tested: U-Net, dense U-Net, Res-Net, recurrent residual UNet (R2UNet), and dense R2U-Net (D-R2UNet). The architectures of the D-R2UNet and the recurrent convolutional layer unit are depicted in Figure 2a,b respectively.

Preprocessing and Training of the Network
All input images were normalized such that the intensity distribution had zero mean and unit standard deviation for consistent CNN processing. Data augmentation was performed to make the network more robust against the degree of enlargement, rotation, and parallel shift. Each image was rotated 90 • , 180 • , and 270 • horizontally and vertically, shifted by a factor of 0.05 and a shear range factor of 0.05. In order to perform multicontrast segmentation and to evaluate the training efficacy, we used two images (i.e., T1w and T2w) concatenated together as channels of a single image, i.e., the network takes a single, 3-dimensional tensor (image dimension, image dimension, and channel). A fivefold cross-validation was performed on the training dataset. The dataset was split into five parts and each part was utilized for training and validation. The training set was used to train the network while the validation set was used to monitor the effectiveness of the training and fine-tuning of the hyperparameters to prevent the network from overfitting to the training data. The mean training and validation loss across the fivefold cross-validation curve is depicted by Figure 2c and the mean dice accuracy curve is depicted in Figure 2d. The validation standard deviation is also depicted in Figure 2c,d by the purple shaded region.

Preprocessing and Training of the Network
All input images were normalized such that the intensity distribution had zero mean and unit standard deviation for consistent CNN processing. Data augmentation was performed to make the network more robust against the degree of enlargement, rotation, and parallel shift. Each image was rotated 90°, 180°, and 270° horizontally and vertically, shifted by a factor of 0.05 and a shear range factor of 0.05. In order to perform multi-contrast segmentation and to evaluate the training efficacy, we used two images (i.e., T1w and T2w) concatenated together as channels of a single image, i.e., the network takes a single, 3-dimensional tensor (image dimension, image dimension, and channel). A fivefold cross-validation was performed on the training dataset. The dataset was split into five parts and each part was utilized for training and validation. The training set was used to train the network while the validation set was used to monitor the effectiveness of the training and fine-tuning of the hyperparameters to prevent the network from overfitting to the training data. The mean training and validation loss across the fivefold cross-validation curve is depicted by Figure 2c and the mean dice accuracy curve is depicted in Figure 2d. The validation standard deviation is also depicted in Figure 2c,d by the purple shaded region.
The training was performed on a standalone workstation equipped with a Quadro P8000 (NVIDIA) graphics processing unit. The networks were trained using the stochastic gradient descent Adam optimizer method [31] with a fixed learning rate of 1 × 10 5 . The initial weights of the filters were initialized using Xavier initialization [32]. The F1-score was used as an accuracy measure for testing the network performance during training and the dice loss was used for the loss function, which was back-propagated through the The training was performed on a standalone workstation equipped with a Quadro P8000 (NVIDIA) graphics processing unit. The networks were trained using the stochastic gradient descent Adam optimizer method [31] with a fixed learning rate of 1 × 10 5 . The initial weights of the filters were initialized using Xavier initialization [32]. The F1-score was used as an accuracy measure for testing the network performance during training and the dice loss was used for the loss function, which was back-propagated through the CNN for the update of the weights after each epoch. The models were trained for 250 epochs and the segmentation probability maps were obtained. In order to obtain the optimized threshold for maximizing the F1-score of the predicted segmentation, we ran the training data through the trained network to generate precision and recall curves. The intersection of the curves gives the optimum threshold value, which maximizes segmentation performance by giving the highest true positives and lowest false positives. The D-R2UNet takes approximately 3.5 h to train for 250 epochs and 100 s to make predictions on the testing dataset.

STAPLE Algorithm to Generate Consensus among Experts
The STAPLE (simultaneous truth and performance level estimation) [33] algorithm was applied to compute the probability estimate of the true segmentation by simultaneously measuring performance level of each segmentation from a collection of raters using the expectation maximization algorithm. The STAPLE algorithm helps in creating a consensus Cancers 2021, 13, 3795 6 of 17 map, taking into account the variability among the individual experts. The segmentation mask obtained from applying the STAPLE algorithm to the expert-generated masks was used for the assessment of the DL algorithm. Out of the 49 PDX scanned in the study, image data from 41 mice were used to train, validate, and optimize the hyperparameters of the network and image data from 8 mice were used for testing the network performance and for further radiomic analysis of the tumor region. The overview of the data is summarized in Table 1. The STAPLE estimation and the actual manual delineations are depicted in Figure 3. The STAPLE (simultaneous truth and performance level estimation) [33] algorithm was applied to compute the probability estimate of the true segmentation by simultaneously measuring performance level of each segmentation from a collection of raters using the expectation maximization algorithm. The STAPLE algorithm helps in creating a consensus map, taking into account the variability among the individual experts. The segmentation mask obtained from applying the STAPLE algorithm to the expert-generated masks was used for the assessment of the DL algorithm. Out of the 49 PDX scanned in the study, image data from 41 mice were used to train, validate, and optimize the hyperparameters of the network and image data from 8 mice were used for testing the network performance and for further radiomic analysis of the tumor region. The overview of the data is summarized in Table 1. The STAPLE estimation and the actual manual delineations are depicted in Figure 3.

Performance Assessment of the Network
We evaluated the performance of the model in predicting tumor boundaries by using an independent testing dataset. The segmentation performance was calculated before post-processing the tumor segmentation maps by removing all but the largest continuous segmentation regions in each 2D slice for radiomic analysis. The segmentation performance of the UNet [18], Res-Net [21], DenseU-Net [23], and D-R2UNet algorithms were assessed relative to STAPLE maps. The following performance metrics expressed in terms of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) were compared: • F1-score-the F1-score measures the spatial overlap between the predicted image and the ground truth and is given by Equation (1).
• Precision-precision signifies the fraction of true positives (TP) in relation to that of the segmented tumor region by the algorithm and is given by Equation (2).
• Recall-recall signifies the fraction of true positives (TP) in relation to that of the ground truth segmentation by experts and is given by Equation (3).
• Accuracy-accuracy signifies the fraction of correctly classified voxels in relation to that of the total number of voxels and is given by Equation (4).

Extraction of Radiomic Features and Correlation between STAPLE and D-R2UNet
The segmented maps obtained from the CNN along with the manually delineated maps of the experts obtained from STAPLE were arranged to create a 3D volume to perform the radiomic analysis. The radiomic features were extracted using an in-house developed program written in MATLAB (R2020a) based on the publicly available "Radiomic-Develop" repository [34] following ISBI [35]. In total 144 radiomic features were extracted for both T1w and T2w MR images. Features were divided into morphological, statistical, and histogram features, as well as GLCM, GLRLM, GLSZM, GLDZM, NGLDM, and NGTDM and were generated from 3D segmented tumor regions with 26-voxel connectivity (see Supplementary Material 1). Shape-based features were extracted from the 3D segmented volume, the intensity-based features (i.e., the first order statistics) were directly extracted from the intensity matrix generated directly from the tumor segmentation maps. Sixtyfour-level fixed bin grey quantization was used to extract the histogram analysis of first order statistics. For the texture-based higher order features, a 64-level quantization was performed using the Lloyd-Max quantization algorithm [36], which iteratively calculates the optimum quantizer level and interval by utilizing the principle of probability density function. Four features were excluded as they directly correlated to some other features, like Compactness 1, Compactness 2, and Spherical Disproportion are correlated to Sphericity, while Sum Average is correlated to Joint Average [37].
The Spearman correlation coefficient (SCC, ρ) was used to determine the degree of correlation between radiomic features extracted from the STAPLE algorithm and the automated D-R2UNet algorithm. All correlation values with p ≤ 0.05 were considered significant. This process was repeated for both T1w and T2w and a threshold of ρ > 0.9 was determined to assess which radiomic features showed high correlation between the STAPLE and D-R2UNet maps.

Reproducibility of Radiomic Features by Experts
The reproducibility of segmentation and radiomic features was characterized by test/retest of manual segmentation. Test-retest of tumor boundary delineation was performed by the experts within a one-week gap in an identical setting on a randomly shuffled version of the same test dataset to check for the reproducibility. After the retest delineation, the STAPLE algorithm was applied to the segmentation maps to generate a single probability estimate map for multiple experts and it was used for reproducibility analysis of radiomic features. To test the reproducibility of the network, the model was retrained using a randomly reshuffled training and validation dataset with identical hyperparameters and then tested on the same dataset as that of the experts. Bland-Altman (BA) analysis was performed on the tumor volumes between the test-retest measurements obtained from the STAPLE algorithm (i.e., the experts) and the D-R2UNet algorithm to infer the degree of agreement between test-retest. The reproducibility of the radiomic features for both the STAPLE and D-R2UNET segmentations was investigated using the concordance correlation coefficient (CCC) [38,39].

Sensitivity of Features to Tumor Probability Boundaries
To evaluate the sensitivity of the features to change in tumor boundaries, we computed the SCC of the change in feature values (∆Feature) to that of the difference in volume (∆V) between STAPLE and D-R2UNet maps. A hierarchical clustering of the features was performed using complete linkage based on the correlation of the differences of the feature values with respect to volume change. We also extended the correlation analysis and clustering to assess the sensitivity of the radiomic features to each other i.e., crosscorrelation subject to boundary change. Complete linkage was chosen for the clustering due to its highest cophenetic correlation coefficient. All of the above-mentioned statistical analyses were performed using Python 3.0 and MATLAB 2020b.

Performance of CNN Segmentation
The measure of agreement between the automated segmentation maps obtained from four different networks, i.e., U-Net, dense U-Net, residual U-Net (Res-UNet), and the proposed D-R2UNet relative to that of the STAPLE maps generated from the expert delineation are summarized in Table 2. The performances of the networks are given in terms of DSC, precision, recall, and AUC. The segmentation performance varied for single-channel (T2w only) and multichannel input (T1w and T2w). Multichannel input exhibited marginally better performance when compared to single-channel input. Among the different networks tested for multichannel input, the D-R2UNet with dice loss as loss function exhibited a better performance in terms of its mean F1-score of 0.948 (95% CI, 0.939-0.956) with respect to the other networks. The D-R2UNet also exhibited the highest precision value, signifying its efficiency in decreasing the number of false positives in detection. From the fivefold cross-validation of the different networks, D-R2UNet exhibited a greater mean F1-score, as depicted in Table 3, and thus was selected as an optimal model for further analysis. The F1-score was chosen as the primary metric for model selection and evaluating segmentation performance because it is regarded as a harmonic mean between precision and recall. Representative examples of T1w and T2w images are given in Figure 4a,b, respectively, with a STAPLE-generated map in Figure 4c. The performance of the D-R2UNet-generated segmentation map with respect to the STAPLE-generated map is depicted in Figure 4d, while the segmentation error of the D-R2UNet relative to the STAPLE algorithm is depicted in Figure 4e.
The segmented tumor volumes were used to construct the 3D tumor volume after post-processing the largest continuous area from each slice. The tumor volume extracted from the experts and the automated algorithm exhibited a high degree of correlation with Cancers 2021, 13, 3795 9 of 17 CCC equal to 0.991, as depicted in Figure 4f. The BA analysis was also performed between the STAPLE and D-R2UNet algorithm tumor volume, where the bias was calculated as the difference of expert volume to that of algorithm delineated volume. The mean bias of 4.6% was obtained from the BA analysis, signifying that the algorithm underestimated the tumor volume by a mean of around 4% relative to STAPLE, which is shown in Figure 4g. The tumor volumes extracted from the D-R2UNet algorithm segmentation are correlated with the tumor volumes STAPLE-contoured by the experts (ρ = 0.99, p < 0.001). From the BA analysis, we also inferred that the mean tumor volume between the STAPLE and D-R2UNet was negatively correlated to the bias, with a correlation value ρ= −0.739. This means that the bias between the STAPLE and D-R2UNet decreased with increase in tumor volume. The segmentation performance varied for single-channel (T2w only) and multichannel input (T1w and T2w). Multichannel input exhibited marginally better performance when compared to single-channel input. Among the different networks tested for multichannel input, the D-R2UNet with dice loss as loss function exhibited a better performance in terms of its mean F1-score of 0.948 (95% CI, 0.939-0.956) with respect to the other networks. The D-R2UNet also exhibited the highest precision value, signifying its efficiency in decreasing the number of false positives in detection. From the fivefold crossvalidation of the different networks, D-R2UNet exhibited a greater mean F1-score, as depicted in Table 3, and thus was selected as an optimal model for further analysis. The F1score was chosen as the primary metric for model selection and evaluating segmentation performance because it is regarded as a harmonic mean between precision and recall. Representative examples of T1w and T2w images are given in Figure 4a,b, respectively, with a STAPLE-generated map in Figure 4c. The performance of the D-R2UNet-generated segmentation map with respect to the STAPLE-generated map is depicted in Figure 4d, while the segmentation error of the D-R2UNet relative to the STAPLE algorithm is depicted in Figure 4e.

Robustness of Radiomic Parameters Extracted from the D-R2UNet Algorithm
Despite the high correlation between the STAPLE-and D-R2UNET-derived tumor volumes, the extracted radiomic parameters varied significantly due to the difference in tumor boundaries. The SCC (ρ) between the STAPLE-and D-R2UNET-algorithm-generated segmentations are reported separately for each category of radiomic features and are provided in Supplementary Material 2. All 12 morphological features (shape-based) showed a high degree of correlation (0.83 ≤ ρ ≤1, p ≤ 0.05). The global intensity features showed a consistently high degree of correlation (0.95 ≤ ρ ≤1, p ≤ 0.05) for T1w and high to moderate correlation for T2w (0.66 ≤ ρ ≤ 1, p ≤ 0.05). For the histogram-based intensity, the degree of correlation varied significantly for both T2w (0.45 ≤ ρ ≤ 0.97, p ≤ 0.05) and T1w (0.69 ≤ ρ ≤ 0.97, p ≤ 0.05) because of the feature value's dependence on the binning process, which is sensitive to segmentation boundaries. The correlation for the textural features also varied widely across the parameters due to the binning, with GLCM (0.61 ≤ ρ ≤ 1, p ≤ 0.05), GLRLM (0.83 ≤ ρ ≤ 1, p ≤ 0.05), GLSZM (0.73 ≤ ρ ≤ 1, p ≤ 0.05), GLDZM (0.73 ≤ ρ ≤ 1, p ≤ 0.05), NGTDM (0.8 ≤ ρ ≤ 1, p ≤ 0.05), and NGLDM (0.73 ≤ ρ ≤ 1, p ≤ 0.05). Textural features that exhibited ρ < 0.7 were found to be statistically insignificant, and hence were not considered. Figure 5a depicts the heatmap representing the degree of correlation between the D-R2UNet-and STAPLE-generated maps for each subcategory of features. A correlation ρ > 0.8 is generally considered an indication for strong correlation [40] between radiomic features and STAPLE, but we opted for a stricter threshold of ρ > 0.9 due to the relatively smaller size of our study dataset. Figure 5b,c depicts the distribution of the percentage of features from each class having high correlations. Percent of radiomic features exhibiting a correlation ρ ≥ 0.9 by class of features is depicted in Figure 5d.

Reproducibility Analysis of the Radiomic Parameters
We evaluated the reproducibility of the radiomic features for both STAPLE-generated maps and for D-R2UNet-generated maps. BA analysis was performed on the testretest tumor volumes for both cases, which calculated the agreement between two measurements. The mean bias of measurement for tumor volume for test-retest among the experts ranged from −2.2% to 8.37% relative to the first delineation. This wide range of variation was solved by using STAPLE for test-retest, which gave a mean bias of −2.8% relative to first delineation, as depicted in Figure 6a. D-R2UNet outperformed every expert and the STAPLE by achieving a mean bias of measurements equal to 1.02% relative to first training run of the D-R2UNet, as depicted in Figure 6b. The CCC were used to assess testretest performance of radiomic features for both T1w and T2w MR images. A feature was considered to be highly reproducible if it had CCC > 0.9 [41,42]. For both D-R2UNet-and STAPLE-generated radiomic features, greater than 80% of morphological and greater than

Reproducibility Analysis of the Radiomic Parameters
We evaluated the reproducibility of the radiomic features for both STAPLE-generated maps and for D-R2UNet-generated maps. BA analysis was performed on the test-retest tumor volumes for both cases, which calculated the agreement between two measurements. The mean bias of measurement for tumor volume for test-retest among the experts ranged from −2.2% to 8.37% relative to the first delineation. This wide range of variation was solved by using STAPLE for test-retest, which gave a mean bias of −2.8% relative to first delineation, as depicted in Figure 6a. D-R2UNet outperformed every expert and the STAPLE by achieving a mean bias of measurements equal to 1.02% relative to first training run of the D-R2UNet, as depicted in Figure 6b. The CCC were used to assess test-retest performance of radiomic features for both T1w and T2w MR images. A feature was considered to be highly reproducible if it had CCC > 0.9 [41,42]. For both D-R2UNetand STAPLE-generated radiomic features, greater than 80% of morphological and greater than 90% of statistical features were highly reproducible. For higher order textural features, the number of reproducible features decreased due to quantization. The percent of reproducible features varied around 70-80% for T1w and 60-70% for T1w with respect to each higher order radiomic subcategory (Supplementary Material 3) and is given in Figure 6c,d, respectively.

Sensitivity of Radiomic Features to Tumor Boundaries
To evaluate the sensitivity of the radiomic features to change in the tumor boundaries, the SCC between changes in the radiomic features relative to the changes in tumor volume were calculated. The frequency distributions and their underlying probability density functions are depicted in Figure 7a,b for T1w and T2w, respectively. Features that had SCC values in the range of −0.4 to 0.4 were considered to be robust to perturbations. Ninety-five T1w radiomic features and fifty T2w radiomic features were found to be robust to perturbations in the tumor boundary (Supplementary Material 4).
The hierarchical clustering of the cross-correlation of change in feature values resulted in 23 and 26 independent clusters for T1w and T2w, respectively, using complete linkage and dendrogram length = 4, as depicted by the clustergram in Figure 8a

Sensitivity of Radiomic Features to Tumor Boundaries
To evaluate the sensitivity of the radiomic features to change in the tumor boundaries, the SCC between changes in the radiomic features relative to the changes in tumor volume were calculated. The frequency distributions and their underlying probability density functions are depicted in Figure 7a

Discussion
In this paper, we implemented CNN-based methods for automatic segmentation of tumors in multi-contrast preclinical MR imaging. All CNN methods proposed within this study performed better than previously published preclinical tumor segmentation methods, including fast k-means-based level-set method [43], which achieved a F1-score = 0.82 in segmenting TNBC PDX MR images, and multi-contrast U-Net, which achieved a F1-score = 0.84 in segmenting sarcoma tumors in MR [44]. Of the five networks tested in the current work, DR2U-Net exhibited marginally better performance in terms of F1-score compared to other DL methods implemented in this work. The dense residual interconnections and the recurrent convolutional units (RCL) facilitated faster learning of features from limited data space and fewer parameters by gradient propagation and feature reuse.
The use of multi-contrast MR imaging instead of single contrast has significantly shown performance improvement in brain segmentation using DL [45]. Our segmentation model was also in agreement with this fact and achieved the best performance for multicontrast data versus T2w-only data ( Table 1). The multi-contrast data combined features from T1w and T2w and facilitated better learning of features. Our approach mimics the clinical scanning protocols where multi-contrast MR are used by radiologists to assess tumor boundaries. One principal issue with manual tumor delineation is the variability of delineation among multiple experts, which leads to lack of reproducibility [46,47]. In this study, as expected, the tumor volume differed substantially between the experts, demonstrating the need to develop a reproducible pipeline. Even after the application of the STAPLE algorithm, which creates a probabilistic map, taking into consensus all the expert delineations, there was variability in repeated delineations among the ground truth delineated by experts. The algorithm was found to be more robust on train-retrain measures, with only 1.02% mean volume bias between two runs. Though the training of DL networks a takes considerable amount of time and resources, it is still less labor intensive and more reproducible than manual intervention.
The fully automated method also accelerated high-throughput extraction of quantitative radiomic features, enabling extraction of mineable data from segmented tumors. Intensity and shape-based (first order) features were highly corrected between the STAPLE and D-R2UNet algorithms. However, the correlation of texture-based features between STAPLE the D-R2UNet varied widely. The textural features were more sensitive to the change in segmentation boundaries as they were extracted by intensity binning of the intensity histogram into different quantization levels. The intensity quantization levels were drastically affected due to the change in boundaries, as change by a few voxels of delineation can affect the intensity quantization process, hence affecting higher-order features. We observed that there was a greater degree of correlation for T1w features than T2w higher-order features because T2w has more variability in texture than T1w, and hence even small perturbations had greater impact on the quantization process. The reduction in number of bins would result in a higher correlation in radiomic features between STAPLE and D-R2UNet, but would fail to capture the dynamic texture of the tumor.
Reproducibility and repeatability are essential elements to enhance the translation of radiomics to clinical practice [48]. Manual delineations are particularly prone to reproducibility issues [49][50][51]. Haarburger et al. compiled a set of robust features by analyzing reproducibility of features for both manual segmentation and probabilistic automated segmentation for clinical CT images [52]. Zwanenburg et al. assessed the robustness of radiomic features to image perturbations associated with test-retest measures [53]. The CCC [41,42,[54][55][56] is widely used as a metric to quantify reproducibility of the features. A suitable threshold value has not yet been established as different studies have used different thresholds. The value of CCC > 0.75 is an indicative of good reliability between two measurements [57,58]. We selected a stricter threshold of 0.9 to signify reproducibility for CCC [41,42], owing to our small number of datapoints to avoid the Type I and Type II errors. The D-R2UNet showed more consistent results with minimal volume bias of 1.02% when subjected to test-retest measures relative to the STAPLE algorithm, which had a volume bias of 2.8%. Even though the volume was is minimal, due to the sensitivity of the features, the number of reproducible features varied widely across for STAPLE and D-R2UNet algorithms. We also observed a greater number of features to be reproducible for T1w than T2w for D-R2UNet because of its less heterogenic texture.
Since variability in delineation of tumor boundaries is inevitable, we attempted to characterize features that are less sensitive to perturbations in tumor boundaries. Among these robust features, we further characterized features that were highly correlated to STAPLE maps (ρ ≥ 0.9, p ≤ 0.05) and were also reproducible for D-R2UNet (CCC ≥ 0.9, p ≤ 0.05) (Supplementary Material 4). We found that for T1w 56 features, i.e., 36.16%, and for T2w 20 features, i.e., 13.9%, features fit all criteria. These groups of features can be used as biomarker indicators in studying treatment response in the preclinical setting using the DL pipeline.

Conclusions
In conclusion, we have implemented and tested DL-based pipelines for accurate and automatic localization of TNBC PDX in multi-contrast small animal MR imaging. DR2UNet performed marginally better than other implemented networks. Nevertheless, the automated methods ensure high throughput tumor segmentation and minimize manual intervention, which in turn enhances reproducibility. Furthermore, we have implemented a radiomics pipeline to characterize the sensitivity of the features to perturbations in tumor boundary. The automated generated maps were found to be highly correlated and reproducible relative to the STAPLE maps and thus can be used for high throughput phenotyping of preclinical MR images in co-clinical trials.